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ABSTRACT 



An automated design system for VLIW processors explores 
a parameterized design space to assist in identifying candi- 
date processor designs that satisfy desired desipn 
co nstrain ts, such as processor cost and performance. A 
vnW synthesis process takes as input a specification of 
processor parameters and synthesizes a datapath 
specification, an instruction format design, and a control 
path specification. The synthesis process also extracts a 
machine description suitable to re -target a compiler. The 
re-largeted compiler generates operation issue statistics for 
an application program or set of programs. Using these 
statistics, a procedure for searching the design space can 
extract internal resources utilization information that is used 
to determine new candidate processors for evaluation. 
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AUTOMATED DESIGN OF PROCESSOR storing the inputs and outputs of the operations, and the 

SYSTEMS USING FEEDBACK FROM interconnect for transferring data between the functional 

INTERNAL MEASUREMENTS OF units and registers. The control path provides control signals 

CANDIDATE SYSTEMS to the control ports in the datapath based on a program, 

5 which is either read from memory or hardwired into the 

RELATED APPLICAnON DATA control logic, 

-n^is patent application is related to the following '^PP°'^°S explicit instruction level 

co-pending U.S. Patent applications, commonly assigned Parallehsm. EPIC processors may also support addiUonal 

and filed S^ncurrently with this appUcation: to miprove processor performance and efficiency 

^ '^'^ . J These features include hardware support for speculation, 

^•^^P-i^^^^.^A^^T^^^^nnfy'r^^^ predication, and data speculation. Other features include 

AUTOMATIC DESIGN OF PROCESSOR rotating registers and special branch instmctions for execut- 

DATAPATHS, by Shail Aditya Gupta and Bantwal ing software pipeUnes with enhanced efficiency. TTiroughout 

Ramakrishna Rau; document, references to a VLTW processor are intended 

U.S. patent apphcation Ser. No. 09/378,293, entitled is to broadly encompass EPIC processors. 

AUTOMAHC DESIGN OF VUW INSTRUCTION ^LIW processors can be grouped into two categories: 

FORMATS, by ShaU Aditya Gupta, Bantwal Ramak- "programmable" and "non-programmable". Programmable 

lishna Rau, Richard Craig Johnson, and Michael S. VLIW processors are processors that can be programmed by 

Schlansker; users. The instruction set of these processors is visible to the 

U.S. patent apphcation Ser. No. 09/378,394, entitled 20 programmer/compiler so that a programmer can write pro- 

AUTOMATED DESIGN OF PROCESSOR grams either directly in the machine code or in a high level 

INSTRUCTION UNITS, by Shail Aditya Gupta and language that is then compiled to the machine code. These 

Bantwal Ramakrishna Rau; processors are connected to a "program memory" that is 

U.S. patent apphcation Ser. No. 09/378,298, entitled used to store the program to be executed. Typically, the 

PROGRAMMATIC SYNTHESIS OF PROCESSOR program memory is part of the memory system that stores 

ELEMENT ARRAYS by Robert S. Schreiber, Bantwal both data and programs, and it is implemented using RAM 

Ramakrishna Rau, Shail Aditya Gupta, Vinod Kumar (random access memory) that can be both read and written. 

Kathail, and Sadun Anik; Non-programmable VLIW processors are designed to 

U.S. patent apphcation Ser. No, 09/378,395, entitled execute a specific apphcation or a fixed set of apphcatioos. 

AUTOMATIC DESIGN OF VUW PROCESSORS, by The primary difference between programmable and non- 

Shail Aditya Gupta, Bantwal Ramakrishna Rau, Vinod programmable processors lies in the way that the control 

Kumar Kathail, and Michael S. Schlansker; and logic is implemented. In programmable processors, the 

U.S. patent appUcation Ser. No. 09/378,601, entitled control logic includes hardware components for fetching 

PROGRAMMATIC SYNTHESIS OF A MACHINE 35 ^ser specified instructions from memory, issuing these 

DESCRIPTION FOR RE T A R G E T I N G A instructions for execution, and decoding the instructions. In 

COMPILER, by Shail Aditya Gupta. non-programmable processors, the control logic does not 

The above patent applications are hereby incorporated by accommodate user modified programs. Instead, the control 

reference. \o^\c is specifically adapted for a particular program. In a 

microprogram approach, the program is represented as a 

TECHNICAL FIELD series of wide words stored in memory. The control logic 

reads the program words, decodes them, and issues them to 

The invention relates to the automated design of elec- control ports of the datapath. This type of processor is 

tronic systems, and in particular, to the automated design of non-programmable in implementations that do not allow the 

Explicitly ParaHel Instruction Computing (EPIC) archilec- to modify the program. In a hard-wired approach, the 

^res. program is hard-wired in control logic, such as a finite state 

machine, that issues control signals to the processor's data- 

BACKGROUND ^^^^ 

As the workstation and personal computer markets are In designing a VLIW processor, a number of cost/ 
rapidly converging on a small number of similar 50 performance trade-ofife need to be made. Each of these 
architectures, the embedded systems market is enjoying an trade-ofife can have a substantial impact on the overall 
explosion of architectural diversity. This diversity is driven system cost and performance. Unfortunately, designing a 
by widely-varying demands on processor performance and VLIW processor today is a fairly cumbersome manual 
power consumption, and is propelled by the possibility of process which must carefully weigh cost and performance 
optimizing architectures for particular application domains. 55 tradeoffs in the light of resource sharing and timing con- 
Designers of these apphcation specific instruction-set pro- strain ts of the given micro -architecture. Optimizations and 
cessors (ASIPs) must make tradeoffs between cost, customizations of the processor, if any, with respect to a set 
performance, and power consumption. In many instances, of applications or an apphcation domain must also be 
the demands for a particular application can be well served determined and applied manually. 

by using a processor having an Explicitly Parallel Instruc- One research effort has focused on the automated design 

tion Computing (EPIC) architecture. One form of EPIC of ASIPs based on a special type of processor architecture 

processor is a very long instruction word (VLIW) processor. called the Transport Triggered Architecture (TTA). See 

VLIW processors exploit instruction-level parallelism MOVE citation. Automated design of a processor is particu- 

(ILP) by issuing several operations per instruction to mul- larly important for ASIPs because it makes it possible to 

tiple functional units. A VLIW processor design specifies the 65 evaluate a number of different processor configurations in a 

processor's datapath and control path. The datapath includes process called "design space exploration." Design space 

the functional units for executing operations, registers for exploration refers to a programmatic search procedure used 
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to investigate some or aU possible processor designs in a in search of candidate processors that satisfy the design 

parameterized space in an automated fashion. The design objechves, sucD a s execuiioa speed, Chip area, circuit 

space of even a simple processor model is large, and complexity, power consUmpiion rgtc! * 

exhaustive search strategies arc of little practical use. Prac- Another aspect of the invention is a programmatic method 

tical schemes can explore only a small subset of the total S for designing a VLIW processor using abstract, non- 

parametcrized space of processors. structural parameters to specify a candidate processor or set 

The published work on TTA processors cited above ofpotentialcandidates. Like the method summarized above, 

outlines a method for automated design space exploration of this method selects a new candidate or candidates based on 

candidate processors based on their cost (e.g., chip area, information derived from a previous candidate processor, 

number of pins, power dissipation and code size) and 10 but this information may be an external metric such as cost 

performance (i.e. the inverse of execution time). This or performance or an internal metric such as internal 

approach is limited because it does not incorporate statistics resource usage. The new candidate processor is specified in 

about internal resource usage of system components in the terms of non-structural parameters, namely, processor 

design exploration process, operations or instruction level parallelism constraints among 

15 the processor's operations. 

SUMMARY OF THE INVENTION Another aspect of the invention is a progranomatic method 

The invention prn-iiir T i pr rir nm i n n ti n r yrt cw jgd designing a VLIW processor based on an evaluation of 
method for exploring thd &esign space of a VLIW compute^ ^ P"o^ candidate processor or set of processors, optionally 

The term "programmati?^ f&ers io a sysiem or method ^ including an evaluation based on the synthesized instruction 

implemented in a program module or set of program mod- ^^^^^^ * P'^^'^ candidate. In addition to providing a 

ules. The system and method aUow system designers to hardware description of the VLIW processor, this method 

evaluate many candidate processor designs in an automated designs its instruction format. In some cases, the 

fashion instruction format may be used to create a hardware descrip- 

One aspect of the invention is a programmatic method for 25 1^™ f processor's control logic. In addition, the instruc- 

designing a VUW processor using feedback about internal format may be used to evaluate the staUc and dynamic 

*• -nu- *!. j j c *• r code size 01 an application program to be executed on the 

resource utmzation. This method reads a specincation of a , , ^ ^ 

j j * ^rrnYT u- u j -u Candidate processor 

candidate VLIW processor, which describes a specific ^ 

instance of a parameterized processor design. It then obtains implementation of the invention is an automated 

internal resource usage statistics for the candidate processor. 30 ^^^^^^ ^^^^^^ comprising a set of program modules. Tlie^ 

For example, in one implementation, a VLIW synthesis system includes components for designing a VLIW proces- 

process programmatically generates a hardware description sor and evalua tmg its cost and perfomiance. The design, 

of the processor. A compiler, re-targeted to the candidate components mclude a datapath synthesizer, instniOtitaL^ 

processor, generates operation issue statistics for an appli- mat designer, and control path synthesiz er. The datapath 

cation program to be executed in the candidate processor. 35 synmesizer reaas an aosiraci msiruciion set architecmre 

The operation issue statistics provide information about how specification, includmg an opcode repertoire, and mstruction 

the candidate processor issues operations during execution ^^^el parallehsm constraints on operations m the opcode 

of the program, such as the quantity, frequency, and timing repertoire, and programmatically generates a datapath sped- 

of the issuance of an operation or set of operations. For fication from a macrocell library. The datapath includes 

example, the statistics may specify how often selected 40 instances of functional units, register files and an intercon- 

operations are issued concurrently. By mapping these sta- between data ports of the functional units and register 
tistics to internal resources such as hardware macrocells, 

register ports or mstruction fields, the design method deter- The instruction format designer'programmatically gener- 
mines how the processor's operations or hardware compo- ates an instruction format from the datapath specification 
r ients are used during execution of the progr am. Each 45 and' the abstract instruction set architecture specification, 
operation in a processors mnut specification map s to a This instruction format includes instruction templates rep- 
functional unit that executes it, and the register ports and resenting VLIW instructions executable in the VUV^ 
instruction fields it utilizes when executed in the processor, processor, instruction fields of each of tiie templates, and bit 
Based on these internal resource usage statistics, the positions and encodings for the instruction fields. The con- 
method determines a new candidate processor or set of 50 ^^ol path synthesizer programmatically generates a control 
processors and provides an input specification for each new path specification from the iostmction format and datapath 
processor. The method then programmatically generates a specification. 

description of the new candidate processor in a hardware The system also includes a program module called the 

description language from the new specification. It is not MDES extractor that extracts a machine description suitable 

necessary to synthesize a complete detailed structural 55 to re-target a compiler. The machine description, referred to 

description of each new candidate processor to evaluate it as "MDES," provides resource conflict constraints derived 

during the design space exploration process. To expedite the from a traversal of a structural description of the processor's 

design space exploration, it is possible to evaluate a candi- datapath. It also provides a specification of the input/output 

date processor based on onl y a partial synthesis of its format of the processor's operations. Parameterized by this 

'st ructural design or based an abstract, non-structural instruc- 60 MDES, a re-targetable compiler generates operation issue 

ti on set arciiitecture specificatiop . Depending on the criteria statistics for a program executing on a candidate processor, 

used to evaluate a candidate, it is possible to evaluate a The components for evaluating the processor include a 

candidate processor based on the description of the new cost evaluator for evaluating cost of a synthesized VLI W 

candidate processor, o r based on a high-level structiira l processor, and a performanoe evalua tor for evaluating per- 

p rocessor design synthesized from the descno tion. The^ 65 Toimauue uf UU ai)plicatiod"program executed on the"^y n- 

pro cess o f^'^pedfying an^ ev aluating Candidate processors t hesized VLIW processo r, me cost evaluator determines a 

may be repealed to explore the parametenzed design space processor's cost m terms of the chip area that it occupies, 
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while the performance evaluator determines its performance FIG. 17 shows an example illustrating how the instruction 

in terms of how fast it executes a specified program. Other format design system uses the ILP constraints on operations 

criteria for evaluating a processor's cost/performance may to build a concurrency matrix data structure and then select 

be used as well. For example, the system may evaluate a instruction templates based on the concurrency relationships 

processor based on the power it consumes by simiming the 5 expressed in the data structure. 

power consumed by each of the hardware macrocclls in its piG. 18 is a diagram illustrating another example of how 

design. Also, since internal usage information is available, the instruction format design system uses ILP constraints 

power consumption can be estimated based on how fre- specified in the input to organize the operations into VLIW 

quently each macrocell is used for a particular application instruction templates. 

program. lo ^ ^ diagram illustratiDg the process of selecting 

Finally, the system includes a spacewalker for selecting a instruction templates 

candidate ^UW proc^r for synthesis by the datapath 20 is a flow diagram QlustraUng a process of 

synthesizer, control path synthesizer and mstruction format ^^j^^ ^^^^ instruction templates from operation issue 

"^^-f ^.^"^ "^^^ performance of the synthe- statistics generated from a compUer. 

Sized VUW proce^ L The snacewalk er mav onerate in 15 ™^ , * i ^ • 

c onjunction with pro cures for extracting internal resourc e 21 lUustrates a control path design system, 

util ization information from candidate proce ssors. These FIG. 22 illustrates an example of a processor control path 

procedures translate resource usage information into pro- data path. 

cessor parameters used to specify a new candidate processor. FIG. 23 illustrates the operation of a shift network for the 

Further features of the invention will become apparent ^ control path design of FIG. 22 for sequential instruction 

from the following detailed description and accompanying fetching. 

drawings. RG. 24 illustrates the operation of the shift network for 

BRIEF DESCRIPTION OF THE DRAWINGS ^^'^^^^ instrucaon fetching. 

FIG. 25 is a flow diagram illustrating the operation of a 

HG. 1 is a design flow diagram lUustratmg a design space software implemented control path design system, 

exploration process for VLIW processors. ^6 is a diagram illustrating aspects of the design of 

FIG. 2 mustrates an example of a structural processor ^ instruction register shift network, 
parameterization. 

FIG. 3 illustrates the design flow of \aJW processor in an 3^ DETAILED DESCRIPTION 

implementation of the invention. ^ t . j 

Ur„ . . , ... . . , . 1,0 Introduction 
FIG. 4 shows an example of how the system might assign 

functional units and allocate register file ports based on a As summarized above, the invention is implemented in a 

high level specification of operation groups and resource programmatic system for automating the design of embed- 

sharing constraints among these groups. 35 ded systems consisting of processors, registers, memories, 

FIG. 5 is a flow diagram illustrating an implementation of 0^*^^^ requisite computing structures. The programmatic 

the datapath synthesis process shown in FIG. 3. ^^^sign system is comprised of the following elements: a 

FIG. 6 is an example of a data structure used to represent P^^^^^ parameterization, a synthesis procedure, cost and 

, . * • * • *u J 4 *u *i. • performance procedures, and search procedures for explor- 

resource shanng constramts in the datapath synthesis pro- f j • 

^.ggg F 7 ^ mg a processor design space. 

FIG. 7 graphically depicts an input specificaUon and its . ^.^ Parameterization expresses each machine confi^ra- 

J * .u J • . -11 * . *i- * *!. J . «i_ Uon in terms of a set of input parameters. Ine synthesis 

corresponding datapath design to illustrate that the datapath , .... r r •^•n 

^. ^. ^ ^ , * ^ 1 J • f procedure reads the input parameters and pro grammatically 

synthesis process produces a stmclural description of the , , i f. r . j .1 

/ ^ c r r J creates a hardware description ofa processor based on them, 

datapath from a specincation 01 its operations and the_ . . c j 

. .'^ , . , f , I, v .u ^if For each processor, the cost and performance procedures 

desired mstruction level parallelism among them. t . rlt n a., r r 

„ .„ , ,^T-r^ 1 , . evaluate the cost or the configuration and the perio nuance 01 

FIG. 8 illustrates an MDES extractor module that extracts • v „ 

. Mi. the configuration running an application program, 

a machine description for re-targe ting a compiler from an „. , , , 'r ^ 

abstract ISA specification and structural datapath speciflca- Using these elements, the system exeoites a search pro- 

tion of a orocessor cedure that investigates the parameterized space m order to 

nf!, , * , . . J * . . * 50 find a candidate processor or set of processors that are 

FIG. 9 Ulustrates an operaUon hierarchy data structure that ^^^^^^ ^^^^^ ^^^j^ ^^ ^^.^^ ^^^^j. 

a re-targetable compiler uses to bind operatioDS witbm a appUcation. Because an exhaustiv e search of the total 

program from the semantic level to the architectural level. * • j • • *• i^*tr 

^jr^ • a J- 11 * *i_ *• r parameterized space is impractical^Tnc search procedure \ 

FIG. 10 IS now diagram illustrating the operation of an ^ . a= ■ , i^cli^.r.i^ a \ 

»*T^i-o . . tor makes a more ecBcient search of % sub set of the sp ace. 

MDES extractor. u *u l * j 3 i > 1 r l' 4 

, , 55 search path may be computed usmg information about 

HG. 11 IS an example of a processor datapath. previously explored systems to develop a set of attractive 

HGS. 12A-B are two distmct operation format and candidatfr.&ysteicr tor wbcb synthesis; cosf ^Vahratioflr^ 

reservation table combmations for the ALU shown 10 FIG. nppft^Sance evaluation are carried out. This process^i? 

>^Mterated until systems of adequate quality are identified. We 
HG. 13 is a reservation table for a SQRT operation. / call procedures that efficiently walk the parameterized space 

FIG. 14 is a diagram illustrating the instruction fonmt of processors in order to identify one or more especially 

design flow in an implementation of the invention. I attractive systems "spacewalkers". 

FIG. 15 is a diagram of an instruction format dat&^,.,^ ^o en hance search efficiency through the design space, the 

structure. system c !<m UAi^uutc a aoaroh proeedm o that iULUlpuiat es 

FIG. 16 is a diagram illustrating an example of an 65 more detailed knowledge of the internal usage of system 

instruction template, which represents a possible VLIW components. More detailed measurement of system internals 

instruction format. such as: the degree of utiUzation of specific components. 
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whether components are sometimes or never used new candidate processors, internal resource usage informa- 

simultaoeoiisly, or detailed measurements of the operation tion may be used to refine or focus the search more effec- 

repertoire required to execute the embedded application, are lively. 

aU useful for determining well^hosen candidate systems in y^^^ additional information from the VLIW synthesis 
the ^acewalking process. Using this information, space- s outputs, the operation issue statistics may be used to deter- 
waDcmg heunsUt^ can identify more attractive subseqiient ^^^^^ usage within the candidate processor, 
system configurations from previously mvestigated system ^^^^^^^ ^ ^ determines register file 
configurations. This information allows spacewalkcrs to j ^ i ♦ .u .i- - e 
identtfy a more cost^ffective processor system after evalu- ^^'^^'on. and ttanslates he utili^auon mformation m.o 
ating a more carefuUy chosen and smaUer number of can- P™^^' parameters used to specify a new candidate pro- 
j -j * * -n. u ' • -1 u a= • -J ce-ssor. Another procedure 42 translates raacrocell utihzation 
didate systems. The result IS improved search efficiency and ^ , . 
often a superior final system. J°'° P™*^^'/ Parameters. Yet another procedure 44 trans- 
^ . .„ . , . r, i. lates instruction field usage into processor parameters. 
FIG. 1 is a diagram illustratmg the design now of a ^ , 
programmatic system for designing VLIW processors. As .^^.^ performance, and/or internal resource 
shown in FIG. 1, this system generates information used to ^^^^l^^n information, a search procedure 50 selects a new 
specify a new candidate processor design based on infor- <^^ndi6,ic processon To accomphsh this, it provides a new 
mation about a prior design. At the start of design space "^Pf.^ specification m tenns of the processor parameters. As 
exploration, the system begins with an initial candidate (or ^^^"""^^ parameters may be structural, non- 
set of candidates) called the "seed." In one approach, the structural or a combmation of both. 

system begins with an inexpensive processor design in terms The following sections elaborate on aspects of Ihn drsiffn -> 

of cost (e.g., chip area) and then selects new c andidates by fljiat Sections 2-5 provide an overview of aspects of the 
modifying parameters in an attempt to improVe pwcc sSbr - spicm germane to spacewalking, 

performance. For example, the system may start vTOna Section 6 describes an implementation of the VLIW 

single issue processor (issues one operation for execution at . synthesis process shown in FIG. 1. Finally, Section 7 pro- 

a time) that satisfies the minimum requirements of the vides some examples of spacewalking procedures. 

application program. Then, the system increases instruction ^ ^ r» n . ■ 

11 11 1- . • c ..I. r 2.0 Processor Parameterization 
level parallelism to improve performance at the expense of 

potentially adding additional functional units. In an alterna- The design space is defined as a set of processor param- 

tive approach, the system may begin with an expensive eters 20 that are used to specify candidate processors. Each 
processor design, and then reduce instruction level parallel- 3Q ca ndidate processor has a corresponding set of parameter 

ism to reduce cost. 'values, i tie user or a program module, such as the 

The system specifies a candidate processor in terms of spaccwallcer, can specify a candidate processor by providing ^ 

processor parameters 20 from a parameterized processor a processor specmcation containing parameters selected ~~ 

space. Using these parameters as an input specification, a from the design space. Usmg mese paramclcts, a '^tJW 
VLIW syndiesis process 22 creates a processor instance. The 35 sy trihesls piOC^sg 22 generates an mstauce o l Itie candrtoS* 

processor instance may include a hardware description 24 ^pracessor " . — . 

(e.g., in VHDL) of the processor's datapath and control path. The spacewalkcr can use a hardware structural processor 

It may also include the processor's instruction format 26 and parameterization 52, an abstract processor parameterization 

a machine description (MDES) 28 suitable to map a 54^ or a combination of both. 

re-targetable compiler 30 to the candidate processor. 40 ^ ^ . c^. . ^ ^ 

,\ , . 2.1. Hardware Structural Processor 

The re-targetable compiler schedules an appUcation pro- Parameterization 
gram 32 and generates a number of statistics files referred to 

as the operation issue statistics 34. The operation issue A hardware structural processor parameterization repre- 

statistics provide histograms indicating the static and sents the space of processors in a manner that closely mirrors 
dynamic opcode usage of the appUcation program. Each of 45 the physical hardware. An example hardware structural 

the opcodes is mapped to an "operation set" in the processor processor parameterization is shown in FIG, 2. This form of 

specification. As explained below, operation sets provide a parameterization is defined in terms of a library of functional 

convenient construct for specifying the processor's opcode unit and interconnect components. A processor is described 

repertoire in terms of sets of operations that share attributes by specifying instances of these library components and by 
as opposed to individual operations. The operations sets are 50 describing how component instances are interconnected, 

mapped to register file types in the Input/Output (I/O) Consider the example shown in FIG, 2. The structural 

formats of an opcode. This correspondence between parameterization uses two basic types of components called 

opcodes, on one hand, and operation sets and register files, functional units and buses. In this example, registers and 

on the other, enables the system to generate dynamic and memories are considered to be special cases of functional 
static usage of each operation set and the associated register 55 units. Each instance of a functional unit is marked "F(<first 

files. parameter>, <second parameter>)". The first parameter rep- 

The system includes a program or programs that imple- resents the type of the functional unit while the second 

ment search heuristics to select candidate processor designs parameter represents an instance index that uniquely iden- 

for evaluation. These search heuristics use infonnalioo tifies each functional unit. 

about a candidate processor's cost and performance to select 60 Functional unit types are "-1—", "*", "mcm2 k", and 

other candidates. A performance evaluator 36 computes the "reg22"; these represent an adder/subtractor, multiplier, 

performance of a candidate processor in terms of execution memory with 2 k size, and two input port/two output port 

cycles. A cost evaluator 38 evaluates the cost of a candidal register file. Buses are labeled Bl . . . B4. Each dark dot on 

process or based on costs of individual components in the. a bus represents a connection between a functional unit port 
hardware description, which fists instances of macpcelR 65 and the bus that crosses the port at the dot. This stylized 

and their corresponding areas, power consumption, etc. stmcnire parameterizes a broad class of interesting process- 

While the cost or performance data may be used to select ing configurations. 
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To implement a system based on these structural input operation, chosen from the operations within its opset, might 

parameters, one starts by defining a data structure to repre- execute. Thus, the first operation group indicates that on 

sent instances of functional units, and buses as well as each clock cycle either an addition or a suburaction might 

connections between buses and functional units. This data execute. If this list of 4 opscts were used as the complete 
structure provides a means for describing processor 5 abstract non-structural specification, it would indicate that 

instances. Using this structure, a spacewalker can express one operation per operation group might be issued per clock 

candidate processor configurations. The synthesis process cycle, where each operation is selected from the opgroup's 

can define actual hardware conforming to the parametric opset. 

specifications. In addition, a cost procedure can evaluate the often the opgroup specification alone does not yield a 
cost of the configuration. Finally, a compiler can compile an lo processor with the desired cost/performance. Instruction 

application program to this processor specification in order level parallehsm constraints, such as exclusion sets and 

to determine performance. concurrency sets, may be used to further specify the desired 

To specify candidate processor designs, the spacewalker ILP of the processor. An exclusion set may be used to 

can use internal or external metrics to alter an existing specify operations that the candidate processor must not 

processor configuration or create a new one. With respect to issue concurrently. The synthesis process can then use this 

internal metrics, the spacewalker can use the internal usage constraint to require the operations in the exclusion set to 

information to identify underutilized functional units, inter- share a hardware resource in the processor. This form of an 

connect hardware, register files/ports, etc. and remove them. exclusion set may be used to reduce the amount of available 

Conversely, it can u.se this information to identify fully parallelism, and thus, the cost of the processor. Alternatively, 

utilized structural components and add additional instances a concurrency set may be xised to specify operations that the 

of them. processor must be able to issue concurrently. This form of a 

concurrency set may be tised to enhance performance by 

2.2. Abstract Non-Stmctural Parameterization requiring the processor to issue frequently used operations 

The use of a hardware structural processor parameteriza- concurrently, 

tion may lead to certain problems in the automated design The user or the spacewalker can define ILP constraints 

process. For instance, the space walking procedure may have among operation groups. For example, each operation group 

to consider a large number of structurally distinct hardware can be uniquely indicated by its instance index. An exclusion 

solutions that are essentially equal with respect to both cost is a tuple: (<first opgroup index>, <second opgroup index>) 

and performance. In addition, the hardware structural pro- which indicates that the synthesis process may use less 

cessorparameterizationmay be made in terms of a library of expensive hardware that does not allow the simultaneous 

available hardware components rather than in terms of execution of operations from the specified operation groups, 

application requirements. This may require that the space- In order lo reduce the cost of the synthesized hardware, 

walker be intimately tied to the actual synthesis approach one might specify additional exclusions such as (2,3), (2,4), 

including knowledge of actual components and rules for (3,4). These exclusions indicate that opgroups 2, and 3, 

legal connection. cannot issue operations simultaneously. Similar constraints 

The use of an abstract, non-structural parameterization have been placed on opgroups 2 and 4 as weU as opgroups 

can assist in the design automation process. An abstract, 3 and 4; thus opgroups 2, 3, and 4 are mutually exclusive and 

non-structural parameterization is a parameterization fi:om cannot issue operations simultaneously. Note that these 

which the processor's structure is not readily apparent. The statements are not structural statements; they do not directly 

processor's structure is determined by a synthesis procedure describe the number of functional units nor do they describe 

that reads the abstract non-structural processor parameter- how the functional units are interconnected. Rather, they 

ization and identifies a processor structure that conforms to describe application requirements needed to achieve a cer- 

the abstract non-structural processor specification using a tain level of performance, A synthesis procedure is respon- 
hardware structure satisfying desired cost/performance con- 45 sible for defining a hardware structure, including functional 

straints. units and an interconnect, which is suitable for satisfying 

The VLIW synthesis process discussed in Section 6 these design constraints, 

generates a structural processor description from an abstract ^.3 Other Processor Parameters 
non-structural processor specification. This specincation 

uses operation sets (opsets) and operation groups (opgroups) 50 Many parameters may be used in the space walking pro- 

to help define processing requirements. As a specific cess. These parameters m ay either he stmgairAl nr abstract 

example, the following textual description illustrates how and min-structura l. In both cases, they are read as inputs to 

one might specify the desired operations of the processor the synthesis procedure and effect the resultant output pro- 

(e.g., its opcode repertoire): add__sub=»{+-}, mul={*}, and cessor. Pa rameters may include the size of re gis ter flles,j he 
mem={ld, st}. Each of these operation sets represents a set 55 size of memories, the width of hteral ficlHs; the chosen 

of operations that is potentially needed by an application means of encoding literals, etc. I ii a sp acewal ker implemg n- 

program that the processor is designed to execute. In tation using the VLIW synthesis pro«ss ot Section 6, the 

general, application programs need differing levels of hard- VLPV par ameters in cplude a register file specification. This 

ware support for operations within each operation set. The s pecifi c^liQn jrovides the type of re g ister files_(e.£., intege r, 

level of hardware support for an operation set is specified floating point. j)redicate. and bra nch tareep and the nu_a fier 

using an operation group. of registers of each type. It also pro videsTErwidttL of hteral^ 

Each operation group (opgroup) represents an instance of jejds in literal re^iste re (e.g./."memory Uterals, branch 

an operation set and is specified here as a tuple: (<operations Uterals, and integer data literals). 

set>, <instance id>). Ths following four operation groups 3 0 S acewalker Feedback 

might be used to specify a chosen level of hardware support: 65 " 

(add^ub,!), (add_sub,2), (mul3), (mem,4). Each opera- The space walking procedure uses information regarding 

tion group indicates that on each clock cycle a single the suitability of previously explored processors in order to 
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select a new candidate processor. After a candidate processor 
is selected, the synthesis process is capable of producing a 
wealth of detailed measurements about the candidate pro- 
cessor and its characteristics when it is used to execute the 
application. 5 

3.1. External Metrics 

External attributes of a processor include measurements 
of its cost and its performance when executing a given 
application. These two external measurements determine the lo 
suitability of the candidate processor for accelerating the 
given application. These metrics are considered as exter- 
nally visible metrics because they are directly related to the 
utility of the candidate processor. The cost and performanc e 
of previously explored jr^c^fi sprs is used to help identify /s 
a ttractive new processor desig ns. Table 1 below lists some, 
examples of system costs and the program that produces the' 
cost data. 



TABLE 1 



25 



COffT 


PROGRAM 


Code Size- ROM size (mm^ 


Linker 


Maciocellfi chip area (mm^ 


VLtW Synthesis , 


Register Inles- chip area (mm^ 


VLIW Synthesis ) 


Register Ports- chip area (mm^ 


VLIW Synthesis / 


Instruction Width- chip area (mm^ 


VLIW Synthesis / 



As reflected in Table 1, the cost evaluator may quantify 
the cost in terms of the area that the corresponding hardware 30 
component occupies on a chip. The linker determines t(ie 
code size, which directly corresponds to the size of the ROl 
needed to store the program. The VLIW synthesis systei 
calculates the co st of the various structural components 

^ tEe processor instanc e. For instance, the synthesis systern^ ^t*; 
computes the chip area occupied by hardware macroce ll 
instances by summing the area of the instances in The 

' ^nthesized processor design . Parameterized cost functi^is 
can be used to determine thecos t in area of components such 
as reg ister files, buffers, logic arrays, etc. t^ or exany cost 
tunction may define the cost ot a register file as a function 
61 its mput ports, output ports, data width, and number nt 
registers. 

The performance of a candidate VLIW processor may be 
measured as the execution time (in cycles) of the target 45 
application program. The re-targetable compiler generates a 
measure of the execution time. 

As noted above, other criteria, such as power 
consumption, may be used either alone or in combination 
with area and execution time to evaluate the merits of 50 
candidate processor. 

3.2. Internal Metrics 

Abroad variety of internal metrics are aJso helpful in the 
spacewalking process. By using internal metrics as feedback 55 
to the space walker, one can improve the eflSciency of the 
spacewalking process. A number of examples of internal 
metrics are listed below: 

1, The statistical frequency of usage of a processor 
component indicating how often a component is used. This 60 
can be used to help delete rarely used components or add 
new instances of highly utilized components. l^^prriU ifln 
issue statistics provide a measure of the dynamic and stat ic 
opcode usage . Since the VLIW synthesis process ma ps 
'hSr^ware components, e.g., functional units and register file 65 
ports, to opcodes, these statistics also provide a measure- 
ment of the usage of hardware components. 



The VLIW synthesis generates an output report that maps 
operations to the various hardware resources they utilize. 
For each register file, the synthesis process specifies the 
number of input and output ports, and the operations 
requesting each port. It also specifies the functional unit 
macroceUs and the operations covered by each macroccU. 
The instruction format synthesizer provides the maximum 
and minimum instruction size. It also specifics for each 
instruction template, the total bit width and the bit width 
requirements of variou s operations that may be issued 
concurrcatiy^thin the instfuC 

Jtatistics measuring the frequency with which" 
Jerations are used simultaneously. The operation issue 
statistics from the re-targetable compiler indicate such 
opcode usage. These statistics can be used to add exclusions 
among opgroups supporting operations that are not issued 
fnn nnrrpntiy thp gypthg sjs proc cdurc cau then create less- 
cosUy^^^ardware that is unable^lu C0unttt«4i^ execute 
o>^fftlicuisfrom such exclusive operation groups."" 

3. Statistics measuring the number or registers or memofcj 
cells which are in use at a given time etc. The spacewalke^ 
can use these statistics to generate machines having fewer ^ 
registers or smaller memories. 

4. Statistics that break down processor cost on a compo- 
nent by component basis can be used to isolate expensive 
components, which are not well used. With this information, 
the spacewalker can focus its attention on costly components 
that are underutilized. 

These and other internal metrics provide more detailed 
information about internal workings of a previous candidate 
processor. Such internal metrics provide spacewalker with 
more precise information about what changes might be most 
attractive with respect to achieving increased performance at 
modest cost or achieving decreased cost with minimal loss 
of performance. This allows for more efficient spacewalking 
)rocedures. 



3.3. Translatinglnternar^ctric Feedback into 
Processor Parameters 

When internal metrics are used in spacewalking, it is 
important that there be a means to use these internal mea- 
surements to make appropriate changes in the processor 
specification. The spacewalker makes such changes in order 
to search for a processor that is superior to those which have 
already been discovered. These changes can be made to 
structural parameters (e.g., adding, deleting, or modifying a 
macrocell). These changes can also be made to non- 
structural parameters (e.g., adding, deleting, or modifying 
opgroups or their ILP constraints). When the spacewalker 
searches an abstract non-structural processor 
parameterization, it may not be clear how to relate internal 
measurements used for feedback to an appropriate change in 
the non-structural parameters used to select a new processor. 

The spacewalker may incorporate a number of capal^ iU- 
ties to help translate results gained from mtemal measure- 
ments bar K 1"'" ^rr''"r''if<T? cnan^es m struchiral aad ho n- 
structural processor parameters. A few of these capabilities 
are outMned below: ' ^ 

3.3.1. Translation of Register File Port Utilization 
Back into Processor Parameters 

After the synthesis process, the relationship between each 
opgroup and the requisite hardware support is known. The 
synthesis process can provide to the spacewalker key infor- 
mation regarding the usage of specific ports by specific 
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opgroups. For each register file, the synthesis process can At this time, the quality of the newly synthesized proces- 

provide the number of input ports and output ports for that sor can be established relative to previously explored 

register file. Further, for each port, the synthesis process can designs. If the spacewalkcr determines that further design 

describe which opgroups utilize that port. With this exploraUon may be fruitful, it may use any information 

information, the spacewalker is better able to understand the S collected for this design, as well as well as similar infor- 

rclationship between underutilized ports and cither the addi- nation collected for previous candidate designs, in order to 

tion of exchisions between opgroups or the elimination of select one or more new candidate processors, 

opgroups that might lead to re-synthesis of hardware struc- Using most processor parameterizations, the number of 

tures with fewer ports. potential processors that can execute a given application is 

10 enormous. The role of spacewalker is to identify process ors, 

3.3.2. Translation of Macrocell Utilization Back whic h deliver the greatest pd5sibi6 ^periarmancc_^at fh e 

into Processor Parameters lowest possible cost . An efficient spacewalker is one that 

For each macrocell used to construct the actual processor, walks a small subset otthe total numbe r of processors and 

the synthesis process describes which opgroups make use of yet identifies processors whi^are particularly efficient, 

that macrocell and the cost (e.g. in VLSI area) of that 15 ^ A processor P is defined to be Pareto optimal if there is no 

macrocell. The spacewalker may use this information in other processor Q which satisfies either l) or 2): 1) Q is les s 

order to add exclusions or eliminate opgroups to c ostly and has the same or greater performance than P; 0X^ 2) 

re-synthesize a new processor where expensive or underuli- C7 has greater performance than P and Q is less or equal in 

lized macrocells have been eliminated. cost to P. Each of these Pareto processors is not eclipsed by 

• 1 20 a strictly better processor. Thus, Pareto optimal-processors 

3.3.3. Translation of Instruction Template a re considered^ as^ex cellent-cand[i-dates-:F or.a.m i aLo.g^^ 

Informauon Back mto Processor Parameters dksiflia^ ^tion .^yto^effi cte i^ sy^iSmlaca lly 

Instructions are expensive to represent in memory and can i dgntificsipnoix^fe liwl^i^^^^ 

be divided into fields whose utilization can be measured, ^so f closetfi j^ t^ ^tg alixilBSr^nmSB -processor) 

Measurements of presence or absence of field usage can be 25~while**insj^Cin^^only. a very small fraction of the total 

used to assess field utilization. Here, a field is underutilized=^^num^^^fcgrgt3e repfesented using the 

when it takes on a value which indicates that it does not processor parameterization. 

participate in the computation; for example when an opera- A number of interesting strategies can be used to craft an 

tion field executes many NOOPs or a literal field usually efficient spacewalker. One strategy begins with low-cost 

provides an unused Uteral, these fields may be considered as 30 systems and considers systems of increasingly greater cost, 

underutilized. Measurements of information content within Here, the spacewalker may take a previously explored 

a field can also measure field utilization. Here, a field is system, which appears attractive, and may add functionality 

underutilized when it statistically takes on only a small to that processor. In particular, functionality should be"added 

fraction of the values that it might potentially represent; for ; which may produce a large performance increase; or func- 

example a wide opcode field which almost always holds an 35 tionality should be added which costs very little or;^-func- 

add operation, or a wide literal field which almost always tionali^^shpuld^bF^deld^Mi^ir'jo^^ 

holds the constant "3" might both be considered as undemti- performance incx^ase*andrCostslittle. Jntemal statistics'men- 

lized. ^tioned above are very helpful in identifying such changes. 

Such measurements can be used by the spacewalker in "These changes are typically reflected by adding opgroups or 

order to search for better processors. However, the space- 40 modifying ILP constraints (e.g., removing exclusions-pr 

walker needs a means to translate information about physi- adding concurrency sets) to enhance performanceT Space - 

cal fields into abstract nonstrucmral processor parameters. walker is often guessing regarding the impact of a specific 

For example, synthesis can describe, for each field, exactly change in either cost or performance. If results are not 

which opgroups make use of that field. This allows the satisfactory, the proposed change may be treated as a. dead 

elimination of opgroups or the placement of exclusions 45 end in the exploration process. 

among opgroups for opgroups responsible for wide and Another strategy begins with a higher cost system and 

expensive fields. attempts to identify very attractive lower cost processors, 

. ^ „ 11 • n I Here, the spacewalker should subtract functionality which 

4.0 Spacewalking Procedures , ., ' ^ ... a: . -c 

* - ' will have very little negative effect on performance or; the 

- The spacewalking procedure is responsible for identifying 50 spacewalker may subtract very expensive functionality; or 

a set of candidate processor designs. Each candidate pro- t^e spacewalker may subtract functionality which is jointly 

cesser design must be provided as a parameter to a subse- not very detrimental to performance as well as very expen- 

quc'nt synthesis process procedure. The processor is ^ ^^^q These changes are typically reflected by either remov- 

described using an appropriate data structure which repre- j^g operation groups in the abstract processor specification 

sents either a hardware structural description or an abstract 55 or by introducing exclusions among operation groups to 

non;Structural description of the processor. decrease cost. 

After a candidate processor is identified and represented, T^^se spacewalking strategies may be mixed and spacc- 

the synthesis procedure is invoked thus constructing the walking may move both upward in cost and functionality as 

actual processor suitable for executing the given appIicaUon. ^s downward in cost and functionality. Such mixed 

Also generated, if necessary, are any programs, data tables, 60 strategies require that some scheme be put in place to 

or other companion information that the processor might preclude endless cycles where spacewalkers continue to 

require to execute the application. With this detailed descrip- re-explore processors that have been previously investi- 

tion of the processor, the automatic design system is able to gated, 
accurately determine both the cost of the candidate design, 

the performance of the design on the given application, and 65 ^.0 Spacewalking EPIC Architectures 

any internal metrics which are used for spacewalking feed- EPIC architectures support a number of advanced features 

back. that provide additional performance or efficiency when 
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compared lo prior non-EPIC architectures. Automated variable. This variable impacts how the compiler schedules 

design techniques including spacewalking can be used in the application code and also instructs the VLIW design 

conjunction with EPIC architectures. In order to incorporate system to select hardware macrocells that support data 

these features in the design space, the processor parameters speculation, where appropriate, 

includes parameters indicating whether a candidate proces- 5 

sor supports the feature, and in some cases, additional 5.4 Rotating Registers and Specialized Branch 

parameters specifying how the feature is supported. Instructions 

A summary of EPIC features that may be explored in a Software pipelining is a compile-time scheduling tech- 
programmatic search of the design space follows below. ^-q^^ overlaps the execution of consecutive loop itera- 

5.1 Support for Control Speculation ^° ^1°°^ 0^^^^ ^peed up execution. Rotating registers and 

specialized branch instructions execute software pipelines 

In control speculation, specialized hardware in the pro- with maximal efiBciency. Rotating registers eliminate the 

cesser uses tagged operands to track erroneous or excep- need for code unrolling which would otherwise be associ- 

tional results that were generated by a speculative operation. ated with software pipelines. A value generated into some 

Such erroneous results are reported or processed by an register while executing loop iteration i can overlap a 

exception handler when such an erroneous result is used subsequent value generated by the same operation (and 

non-speculative ly. referencing the same result register) at loop iteration i+1. 

In the VLIW synthesis process discussed in Section 6, Register rotation causes the reference to the same target 

speculation is specified independently for the hardware and register during the i+1 iteration to reference a distinct 

compiler. The hardware options are [none, conventional, physical register. Specialized branch instruction controls the 

tagged]; tbe compiler options are [none, restricted, general]. rotation of the register files in response to executing new 

The meaningful [hardware, software] combinations are loop iterations. 

[conventional, none], [conventional, restricted], [none, ^ candidate processor's input specification indicates the 

general], and [tagged, general]. ^ presence or absence of rotating registers through a boolean 

5 2 Support for Predication variable. This variable impacts how the compiler schedules 

the application code and also instructs the VLIW design 

Operations read an additional guarding predicate operand, system to select hardware compatible with rotating registers, 
typically a single bit stored in a predicate register file. These 

operations either execute or are nullified according to the 6.0 VLIW Synthesis 
value of the guarding predicate. Other compare operations 

compute predicates for later use as guards. Predicated execu- 6.1 Introduction 

tion can be used to eliminate branches or to generalize the ^ ^ ^^^^^ ^ iUustrating the design flow in 

laws of compile time code motion, thus producmg more ^ ^^IW design system. The system is implemented in 

eflBcient static schedules. 3^ collection of program modules written in the C** program- 

The support for predication is specified in a processor ^^^^g language. While the system may be ported to a variety 

parameter. The choices are: supported by both hardware and computer architectures, the current implementation 

software, or by neither. In addition, the register file sped- executes on a PA-RISC workstation or server running under 

fication may include a predicate register file type and the HP-UX 10.20 operating system. The system and its 

number of registers in each file of this type. 40 components and functions are sometimes referred lo as 

5.3 Support for Data Speculation ^^'""^ "programmatic.^' The term "programmatic" refers to a 

process that is performed by a program implemented in 
In traditional architectures when a load appears after a software executed on a computer, in hardwired circuits, or a 
store in a program and the load potentially aliases with that combination of software and hardware. In the current 
store, the load must be held after the store in the program 45 implementation, the programs as well as the input and output 
schedule. This insures that if they have a common memory (jata structures are implemented in software stored on the 
address, the value stored by the store operation is subse- workstation's memory system/fhe programs and data struc- 
quently loaded by the load operation. In many cases, these tu^es may be implemented using standard programming 
addresses are never or are rarely the same. However, the languages, and ported to a variety of computer systems 
compiler must conservatively generate a sequential schedule 50 having differing processor and memory architectures. In 
becau.se it has been unable to prove that they do not alias. general, these memory architectures are referred to as corn- 
Data speculation replaces the conventional load with two puter readable media, 
operations, the data-speculative load and the data-verifying Before outlining the design flow, it is helpful to begin by 
load. The data-speculative load may be moved above prior defining terras used throughout the description, 
potentially aliasing stores, thus, allowing more efficient 55 

program schedules. The data-verifying load appears after 6.2 Definitions 

potentially aliasing stores but executes with very low VLIW 

latency. When no store address afiases with the data- VLIW refers to very long instruc tion word processors. In 
speculative load, the data-verifying load does nothing and the context ofjhis docu^ ^^£l^|^sg|tg^rg^o re gen- 
program execution continues. When hardware detects that a 60 erally to anCexpli ^d ^tpaFjflSAc^^ (EPl C) 
data-speculative load is followed by an aliasing store ^C^^^itegture-'fliat expliciUy encodes multiple independent 
(presumably a rare event), a subsequent data-verifying load operations' within each instruction, 
stalls the processor and executes the load in order to ensure Operation Set "7^^?>- , - ^ 
that the stored data is properly loaded before execution / An jD peratiqn&s et is ^a ^^^ miitually 
resumes. 65 excliisiyf^SSSnSS^S^^^^g^ 

A candidate processor's input specification indicates the represent opcodes jn ^^^J^f^^igg-^^ onl^^^nv^niencc 

presence or absence of data speculation through a boolean ^^d is-no't^req^^^^^^mplement^ the system. While each 
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Operation set can consist of a single opcode, it is more a spccificauqn^iQh&^e^ 



convenient to specify opcodes with similar properties as a " ^ s^^ i|o nn^Qr^coi^^^ 

set. This approach simplifies the input specification because ^^g^frm ^Snk offc^Eurrei^^v^^ 

the user (or another program module) need only specify sp^^dfies which^setSjOfiOperation gnjups/^ 

desired concurrency and/or exclusion relationships among 5 issued conSirre ntly ; and 

sets of operations, as opposed to each individual operation. olhcr"^^SFarcffitScfufre"param^^ presence/ 

Though not required, the opcodes in an operation set may , absen^f^pf^iSaTioKs^^^^ etc. 

share similar, properties, such as latency and data type: Ror ' ^^^^ arcva^vSricty of ways to represent the ILP con- 

exampl|.^mtegerp^arithmeyc operations sueh^as^^^AJDD and straints. The user (or another program module) may specify 

^SUB«^^™gh^ "^e organized in an operation set. In the th^ ^^^^^^ jlP by specifying exclusion and concurrency 

des^ption that foUows, we use the notation, ops ( ) to relationships among operation group occurrences. One way 

represent an operation set in textual form. jo specify exclusion and concurrency relationships is to 

Operation Group construct a data structure representing AND-OR relation- 

An operation group is an instance of an operation set. ships among operation group instances, such as a multi-level 

Operation groups make it possible to specify that multiple ^5 AND-OR tree. In such a structure, an AND relationship 

instances of the same operation be issued concurrently. For represents a concurrency relationship among operation 

example, one may want a processor to be able to execute gj-^^p occurrences. Conversely, an OR relationship repre- 

three integer ADD operations concurrently. Thus, the ^^nts an exclusion relationship among operation group 

designer could specify that the input specification wiU occurrences. Another way to specify exclusion and concur- 

include three operation groups, A, B, C, each representing an ^ ^^jjcy relationships is through a graph data structure where 

mstance of the operation set, ops (ADD SUB). ^^^^^ represent operation group occurrences, for 

Operation Group Occurrence example, and the edges connecting the nodes represent 

An operation group occurrence is an occurrence of an ' exclusion or concurrency relationships among the nodes. Yet 

operation group in a particular concurrency set or exclusion another way is to specify pairwise exclusions between 

set. The operation group occurrence enables the processor ^5 operation group occurrences. It is important to note that our 

designer to identify concurrency or exclusion relationships approach of organizing operations into operation sets, opera- 

among operation groups explicitly in tiie input specification. j^qj, groups, and operation group occurrences is just one way 

For example, consider an operation group A that is an facilitate expression of ILP constraints. Other ways to 

instance of the operation set ops (ADD SUB). This operation organize operations and to express ILP const raints among 

group may be issued concurrently wit h-^m^ y^jifi'erent 3Q tK ^s g^npf^r ^TlTmi^T ff ^ hp IT ' i '^d Hb ' ^Vt^ll " " 

combinations of other operation groups. Irifi^ ^^^^^if v^. ArchSpec 

these^^concuyrency^^^^^ , The ArchSpec is a textual, external file format for the 
V aUowswa^ dMereni-"oeeurrengeii«treTg??ri^^ eicr) of the Abstract ISA specification. The Archspec may be converted 
same operation -group to be member of each of these ^ abstract ISA spec data structure, which is then pro- 
concurrency sets. 35 cessed further to synthesize a processor design. While the 
Concurrency Set specific format of the ArchSpec is a textual file, it is not 

A concurrency set is a set of operation group occurrences ^.^tical that the input be specified in this form. For example, 

that may be issued concurrentiy input could be specified via a graphical user interface and 

Exclusion Set converted into an abstract ISA data structure. 

An exclusion set is a set of operation group occurrences instruction Format Specification 

that are mutually disjoint. In other words, the exclusion set The instructioTuformat specifies the instructions capable^ 

specifies a set of operation groups, each havmg operations Slm^^^S^ a VLIW processor design. These 

that cannot be executed concurrently with any of the opera- instructions are represented as instruction templates in the 

tions in each of the other groups in the exclusion set. When current implementation. The instruction format also includes 

specifying ILP constraints in terms of an exclusion set, the 45 the instruction fields vsdthin each template, and the bit 

exclusion sets may be expressed as a set of operation groups positions and encodings for the instruction fields, 

or operation group occurrences. Concrete ISA Specification 

Abstract Instruction Set Architecture Specification concrete ISA specification includes the instruction 

An Abstract Instruction Set Architecture (ISA) Specifica- ^^^^^^ specification and a register file specification of a 

tion is an ab.stract specification of a processor design and 5Q processor design, 

may include the following; Register File Specification 

an opcode repertoire, possibly structured as operation a register file specification of a processor includes reg- 
rets; ister files, the types of these register files, and the number of 
a specification of the I/O format for each opcode; registers in each file. It also includes a correspondence 
a register file specification, including register files and 55 between each operand instruction field type and a register 
specifying their types and the number of registers in file. 

each file; As explained above, the register file specification may be 

In our implementation, the register file specification provided as part of the abstract ISA. 

includes the following: The register file specification may also be taken from the 

1. Register file types — e.g., integer, floating-point, 60 data path specification, or in some applications, it may be 
predicate, branch, etc. taken from a concrete ISA specification. 

2. The number of register files of each type. Macrocell Library 

3. The number of registers in each file. Registers are A macrocell library is a collection of hardware compo- 
divided into static and rotating. Thus, it specifies the number nents specified in a hardware description language. It 
of static registers and number of rotating registers. 65 includes components such as gates, multiplexors (MUXes), 

4. The bit-width of registers in a file. registers, etc. It also includes higher level components such 

5. Presence or absence of speculative tag bit. as ALUs, multipliers, register files, instruction sequencers, 
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etc. Finally, it includes associated information used for 
synthesizing hardware components, such as a pointer to a 
synthesizable VHDLVVerilog code corresponding to the 
component, and information for extracting a machine 
description (MDES) from the functional unit components. 

In the current implementation, the components reside in a 
macrocell database in the form of Architecture Intermediate 
Representation (AIR) stubs. During the design process, 
various control path design program modules instantiate 
hardware components from the AIR stubs in the database. 
The MDES and the corresponding information in the func- 
tional unit component (called mini-MDES) are in the form 
of a database language called HMDES Version 2 that 
organizes information into a set of interrelated tables called 
sections containing rows of records called entries, each of 
which contain zero or more columns of property values 
called fields. For more information on this language, see 
John C. Gyllenhaal, Wen-mei W. Hwu, and B. Ramakrlshna 
Rau. HMDES version 2.0 specification. Technical Report 
IMPACT-96-3, University of Illinois at Urbana-Champaign, 
1996. 

Architecture Intermediate Representation 

The Architecture Intermediate Representation (AIR) is a 
hardware description representation in a machine- readable 
form. The form of AIR used in the automated control path 
design is similar to VHDL, but is implemented in a computer 
language thai makes hardware components described in AIR 
format easier to manipulate with the program routines. 

AIR provides a nimiber of C** classes that represent 
hardware components such as registers, ports and wiring. An 
AIR design consists of objects instantiated from these 
classes. For example, an AIR representation of the control 
path may include a number of macrocell objects represent- 
ing hardware components such as a register, a FIFO buffer, 
multiplexor, a tri-state buffer, and wiring. Each of the 
macroceUs may have a number of control and data ports in 
AIR format and may be interconnected via an AIR wiring 
data structure. 
Data Path Specification 

The data path specification is a data structure specifying 
functional units, register files and interconnections between 
the data ports of the functional units and register files. The 
data path also specifies control ports, such as the opcode 
inputs of the fiinctional units and the register file address 
inputs. However, the task of connecting these control ports 
to the decode logic in the control path is left to the control 
path design process. 

In the implementation, the data path specification is a set 
of related object instantiations in AIR format, enumerating 
the macrocell instances of functional units and their inter- 
connect components, such as multiplexors, tri-state buffers, 
buses, etc. 
Instruction Unit 

The instruction unit includes a control path and an instruc- 
tion sequencer. The control path has three principal compo- 
nents: 1) the data path of an instruction from the instruction 
cache to the instruction register (IR) (the lUdalapath), 2) the 
control logic for controlUng the lUdatapath, and 3) the 
instruction decode logic for decoding each instruction. 

In the current implementation, the lUdatapath starts at the 
instruction cache and ends at an instruction register that 
interfaces with the instruction decode logic. It includes 
instruction prefetch buffers and an instruction alignment 
network for aligning the instruction in the instruction reg- 
ister. Connected between the sequencer and lUdatapath, the 
lU control logic is combinational logic used to control the 
instruction prefetch buffers, and the aligimient network. 
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The conU-ol logic also provides infonnation to the instruc- 
tion sequencer that istised to initiate the fetching of the next 
instruction from the ICache. For example in the current 
implementation, the control logic processes information 

5 from the instruction that specifies the width of the current 
instruction and indicates whether the next instruction is 
aligned to a known address boimdary (e.g., an instruction 
packet boundary). The width of the current instruction is 
derived from an instruction identifier called the template ID. 

10 The packet boundary information is specified in the instruc- 
tion as a consume-to-end-of-packet bit indicating whether 
the next instruction directly follows the current instruction 
or starts at the next packet boundary. Tliis bit is used to align 
certain instructions (e.g., branch targets) to known address 

15 boundaries. The instruction may also include spare bits that 
encode the number of no-op cycles to follow the current 
instruction. 
Instruction Sequencer 

The instruction sequencer is the control logic that inter- 

20 faces with the control logic of the lUdatapath and specifies 
the sequence of instructions to be fetched from the instruc- 
tion cache. It manages a memory address register (MAR) 
that holds the memory address of the next instruction to be 
fetched from the instruction cache, and the Program 

25 Counter, identifying the next instruction to be executed in 
the processor. The control ports of the sequencer interface 
with the conU-ol ports of the lUdatapath control logic. The 
sequencer is also responsible for interfacing with the branch 
functional unit and for managing events such as interrupts 

30 and exceptions. The sequencer is a generic macrocell. 
Control Path Protocol 

The control path protocol provides a structural and pro- 
cedural model of the control path. The structural model 
identifies the types of macroceUs used to construct the 

35 control path, as well as the parameters of these components. 
Examples of these components may include a prefetch 
buffer that covers the latency of sequential instruction fetch, 
an instruction register for storing the next instruction to be 
issued to the decode logic, and an aUgnment network made 

40 of multiplexors for aligning the next instruction to be issued 
in the processor. 

The procedural model generally specifies the method for 
fetching instructions from an instruction cache and issuing 
them to the control ports in the data path. 

45 The automated design system described below is pro- 
grammed to design a specific instance of a control path 
based on a predefined control path protocol. The design 
process includes assigning specific values to the control path 
parameters in the structural model, and converting the 

50 procedural model into a specification of the control logic for 
controlling instruction fetching. 
Basic Block 

A basic block is a sequence of program statements in a 
computer program. The flow of control enters the basic 

55 block only through the top of the basic block, in a condi- 
tional branch, and exits only at the bottom. A related term is 
a superblock (also referred to as a hyberblock). In a 
superb lock, the flow of control enters only at the top, and 
may have one or more exits fi*om the side or bottom of the 

60 block. 

Basic and super blocks are useful in simulating the 
performance of a computer program. For example, code 
simulators can simulate the program, before abstract opera- 
tions in the program are mapped to specific processor 
65 resources (e.g., functional units and registers), and provide 
statistics indicating how many times each basic block in the 
program will be (or are likely to be) visited during execution 
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of the program . Superblocks can be used in a similar manner. The bit allocation process 32 operates on the bit allocation 

except that additional information is required to indicate problem specification 34 to allocate bit positions to each of 

how many limes a superblock is exited and from which the instruction fields of the instruction templates. The output 

exits. of this process is the instruction format specification 28. 

5 Using the instruction format and datapath specifications 

6.3 Outline of the Design Flow 28, 26 and selected instruction cache parameters 42, a 

. , . . ^ . . control path design process 44 generates a control path 

AS summanzed above the VUW design system may be ^ • ^_ ■ ^^ 

mentation shown in FIG. 1, the 

used m a variety of dififcrent design sceaanos. In some ^^^^^^ generates a hardware description of the processor's 
scenanos, the VLIW developer may use the system to control path by selecting instances of hardware components 
generate a portion or all of the VLIW processor design from the macrocell database 22. The system designs the 
programmaticaUy. In others, the developer may use the control path based on a predefined, parameterized control 
system to optimize an existing VUW design or part of an path protocol. The protocol specifies the general approach 
existing VLIW processor design. As such, there are many for fetching instructions from an instruction cadie. buffering 
different possible starting and stopping points in the design these instructions, and aligning them in an instruction reg- 
flow. ister. The control path design process augments the hardware 
FIG. 1 depicts data structures as boxes and programmatic description of the datapath by specifiying instances of the 
processes as ovals. The data structures represent starting control path hardware and specifying how these hardware 
points, stopping points, and, in some cases, both potential components are connected to the control ports in the data- 
starting and stopping points of VLIW design processes. The path. 

data structures may be provided in an external form, such as At this stage in the design flow, the processor design 

a text file, suitable for input by or output to the user. In includes the instruction format specification and a hardware 

addition, the data structures may be provided in an internal description of the datapath and control path. In some 

form, meaning that it is primarily accessed and manipulated scenarios, the processor design may be optimized further, 

by program routines. Whether in external or internal form, ^ One form of optimization used in the system is the customi- 

the data structures are "computer-readable" in the sense that zation or optimization of the processor design based on 

the program modules in the system may read or write these internal usage statistics, such as operation issue statistics 48. 

data structures. The system shown in FIG. 1 includes software modules 

Below, we outline the design flow from an abstract for extracting a machine description called MDES at varying 

specification of the processor to a complete description of 30 stages of the VLIW design flow. These modules are labeled 

the processor in a hardware description language. In generally as MDES extraction 50. The MDES extraction 

addition, we cite some alternative design scenarios. A num- process programmaticaUy generates an MDES description 

ber of design scenarios are possible and should be apparent 52 for driving a retargetable compiler 54. The MDES 50 is 

from the detailed description of the system's implementa- represented in database tables that provide the op code 

tion. 35 repertoire of the processor, their 10 formats, their latencies 

The design flow shown in FIG. 1 begins with an abstract and resource sharing constraints of the operations. For each 
ISA specification 20 and a macrocell database 22. The operation, the resource sharing constraints specify the times 
datapath design process 24 programmaticaUy generates a at which the operation uses certain resources (e.g., register 
datapath in a hardware description language by reading the file ports, data path interconnect buses, etc.). The re tar- 
abstract ISA specification and building the datapath 26 using 40 getable compiler queries the database tables to obtain the 
instances of register files, functional units and interconnect constraints used to schedule an application program 56. The 
components from the macrocell database. retargetable compiler 54 provides a schedule of the program 

The system may then create an instruction format sped- from which the operations issue statistics 48 mav be gath- 

fication 28 based on the data path 26 and abstract ISA spec l^ese statistic s mdicate the combmatioris of the opera- 

20. The IF design process includes two primary components: 45 tions that are issued concurrently as weU as their trequency 

1) semp bit aUocation process 30; and 2) a bit allocation ."I ^^^^ance 

process 32. The first component sets up a bit allocation 1° a process called "custom template selection" 56 the 
problem specification 34, which is used by the second system uses the operation issue statistics to select custom 
component. The setup process initially selects instruction instruction templates 58. In general, the custom templates 
templates based on the ILP specification. The instruction 50 specify operation group occurrences and their ILP con- 
templates each specify operation group occurrences that straints. The system optimizes the instruction format for the 
may be issued concurrently. It then builds the IF-tree data appUcation program 56 by using the custom templates as 
structure containing instruction fields, corresponding to vari- weU as the ILP constraints in the abstract ISA specification 
ous datapath control points, that need to be assigned bit to set up the bit aUocation problem specification, 
positions within each instruction template. To set up the bit 55 Since the MDES may be extracted at various points in the 
allocation problem, the process extracts instruction field bit design flow, the custom templates based on this MDES may 
requirements 36. It then identifies instruction field conflict be used at various points in the design process as weU. In 
relationships 38, specifying which fields must not share bit particular, the system may perform MDES extraction based 
positions in the instruction formal. FinaUy, it partitions solely on the abstract ISA specification, based on a combi- 
instruction fields into groups based on instruction field to 60 nation of the abstract ISA specification and the datapath, and 
control port mappings. Instruction fields that map to the finaUy, based on a combination of the abstract ISA, the 
same control port are grouped together in a "supcrfield." datapath and the control path. As the system specifies 
Instruction fields in a group may share bit positions but are additional hardware for the processor, such as the datapath 
not forced to share. The user may additionaUy specify that and control path hardware, it can augment the resource 
certain instruction fields in a group must share bit positions. 65 reservation tables used to retarget the compUer. 
This process creates "preferred" and "must** superfield par- In some design scenarios, the system may not start with 
titionings 40. an abstract ISA specification. Instead, it may derive it from 
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an existing datapath specification 26 or concrete ISA sped- Ran. HPLPlayDoh Architecture Specification: Version 1.0. 

ficalion 60. In the first case, the datapath specification may Technical Report HPL-93-80. Hewlett-Packard 

have been specified by hand, or may have been generated in Laboratories, Feb. 1994.) In the context of this document, 

a previous design pass through the VliW design flow. In the the term VUW is construed broadly to eacompass Explicitly 

second case, the concrete ISA specification may be provided S Parallel Instruction Computing (EPIC) Architectures. The 

as input based on some existing processor design. For architecture family specifics a superset of opcodes (e.g., the 

example, the developer may want to create the next genera- HPL-PD family instruction set), a set of logical register files 

tion of a processor based on the concrete ISA specification to store various types of operands, and a specification of 

for the current generation. Alternatively, the developer may which logical files each opcode can source/sink its operands 

wish to optimiix an existing concrete ISA specification for lO from/to — its (logical) operation format. TTie specification 

a particular application or application program. further specifies the semantics of important architectural 

To support these design scenarios, the system includes mechanisms that may be included or excluded, such as 

modules 62, 64 for extracting an abstract ISA specification predication, speculation, support for modulo-scheduling etc. 

from an existing datapath and concrete ISA specification. At an abstract level, the ArchSpcc need only specify the 

respectively. Once the abstract ISA specification is 15 functionality of the hardware implementation in terms of its 

extracted, the system or the user may alter the abstract ISA opcode repertoire and the desired performance level. In 

specification before using it as input to the VLIW design general, the ArchSpec enumerates the set of opcode 

flow. One particular example is the use of custom templates instances that are to be implemented by the target machine, 

based on operation issue statistics. Many other scenarios are and provides a description of the amount of ILP that is to 

possible. For example, the system or user may alter the ^ exist among them. 

opcode repertoire and ILP constraints to achieve an opti- convenience, the various instances of the opcodes for 

mizcd design based on cost/performance tradc-offis. ^ g-^^j, machine are grouped into Operation Groups, each of 

^ ^ ^ n t-1 n_ which is a set of opcode instances that are similar in nature 

6.3.1 Non-Programmable tocessors . r.u - i * j *■ ♦ u • i ■* 

25 ^ terms of their latency and connectivity to physical register 

The types of program modules and data structures used in files and are to be mutually exclusive with respect to 

the design flow wiU vary for the design of programmable operation issue. For example, since add and subtract opera- 

and non-programmable processors. In the context of the tions require similar operand types and execute on the same 

VLIW design flow, the design of programmable and Don- ALU, their respective opcode instances may be placed in the 

programmable processors difi^ers in the way the control logic same operation group. By definition, all opcode instances 

is designed. The system illustrated in FIG, 1 may be adapted within an operation group are mutually exclusive, while 

to design processors having the following types of control those across operation groups are allowed to execute in 

logic: parallel. 

1. Program Counter based; and The paralleUsm of the machine may be further con- 

2. Finite state machine. 35 strained by placing two or more operation groups into a form 
Each of these forms of control is described in the back- of an exclusion set called an Exclusion Group, which makes 

ground section. In the first case, the design flow generates all their opcode instances mutually exclusive and aUows 

the control path based on parameterized control path and them to share resources^Eotjgins^ 

control path protocol, based in part on an instruction format ma^Linclud^multipJyi,,and^^^^ 

specification. The design flow may generate the instruction tsc^^^^^^^m^^^^^^^^^^T^ 

format programmatically, or it may be specified by hand. -As "arf "Example, a simple Sossue machine is specified 

In the second case, the control logic is in the form of below. This example specification is expressed in a database 

hard-wired logic, such as a finite state machine. To design a language called HMDES Version 2, See John C. Gyllenhaal, 

processor that employs this form of control logic, the design Wen-mei W. Hwu, and Bantwal Ramakrishna Rau. HMDES 

flow may be adapted as follows. First, the components and 45 version 2.0 specification. Technical Report IMPACT-96-3, 

data structures used to design the instruction format are University of Illinois at Urbana- Champaign, 1996. This 

unnecessary (processes 30, 32, 56, and 64; and data struc- language organizes the information into a set of interrelated 

tures34, 48, 58, and 60). Next, the control path design would tables called sections containing rows of records called 

be replaced with an NPA logic design process and the output entries. Each entry contains zero or more columns of prop- 

of this process would be a hardware description of the finite 50 erty values caUed fields, 
state machine. As before, an MDES would be extracted from 

the datapath and used to retarget the compiler. The retar- SECTION Op6ration_Group { 

getable compiler would then be used to generate a scheduled OG_alu_0 (ops (ADD SUB) format (0F_intarith2)); 

program, which in turn, would be provided as input to the , ^ / /at>t^ ctid\ e * /r\c • * 

kPA logic design process. ss OG_alu_l (ops (ADD SUB) format (QF_mtanth2)); 

In each of the three approaches, the hardware technology 

used to implement the processor, including the datapath and OG_move_0 (ops (MOVE) format (0F_intarithl)); 

control logic, may be any of a variety of hardware method- OG_move_l (ops (MOVE) format (OF_Jntarithl)); 
ologies such as FPGA, custom logic design, gate arrays, etc. 

6.4 Implementation of the Abstract ISA '° OG_mult_0 (ops (MPY) format (0F_intarith2)); 

0G_shift_l (ops (SHL SHR) formal (0F_intarithl)); 

6.4.1 ArchSpec 

The ArchSpec is an abstract textual description of a } 

specific VLIW machine drawn from a generic architecture 65 SECTION Exclusion_Group { 

family, such as the HPL-PD family of architectures. (See EG_0 (opgroups (OG_alu_0 OG_move„0 

Vinod Kathail, Michael Schlansker, Bantwal Ramakrishna OG_mult_0)); 
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EG_1 (opgroups (OG_alu_l OG_move_l 
OG_shift_l)); 

} 

This example specifies two ALU operation groups (0G__ 5 
alu_0, OG_alu_l), two move operation groups (0G_ 
move_0, OG_move_l), one multiply group (OG_mult_ 
0), and one shift group (OG_shift__l). These operation 
groups are further classified into two exclusion groups 
(EG_0, EG„1) consistent with a two -issue machine. The 10 
multiply group shares resources with one ALU group, while 
the shift group shares resources with the other Each opera- 
tion group also specifies one or more operation formats 
shared by all the opcodes within the group. Additional 
operation properties such as latency and resource usage may 35 
also be specified, as shown below. 

SECTION Operation_Group { 

OG_alu_0 (ops(ADD SUB) format("OF_inlarith2") 
lat6ncy(0L_int) ^° 
resv(RT_OG_alu_l) 
alt_priority (0)); 

...} 

The "resv" parameter provides an abstraction for speci- 
fying user-defined sharing. The "alt_priorily" parameter 
provides the priority of the operation group in the MDES, 
which the retargetable compiler uses to schedule the opera- 
tions. There is a similar set of parameters for each operation 
group. 

The ArchSpec additionally includes information to 
describe the physical register files of the machine and the 
desired connectivity of the operations to those files. A 
Register File entry defines a physical register file of the 
machine and identifies its width in bits, the registers it 
contains, and a virtual file specifier corresponding to the type 
of data (operands) it is used to carry. The virtual specifier 
assxmics an implied connectivity between the opcodes and 
the register file, e.g., a floating point opcode would need to 
connect to a floating point-type register file, etc. As an 
alternative to impUed connectivity, the user may specify an 
explicit connectivity by specifying a mapping between each 
operation and the type of register file associated with it. 

The register file entry may also specify additional prop- 
erties such as whether or not the file supports speculative 
execution, whether or not the file supports rotating registers, 
and if so, how many rotating registers it contains, and so on. 
The immediate literal field within the instruction format of 
an operation is also considered to be a (pseudo) register file 
consisting of a number of "literal registers" that have a fixed 
value. 

The Operation Format (10 format) entries specify the set 
of choices for source/sink locations for the various opera- 
tions in an operation group. Each operation format consists 
of a list of Field Types (ID Sets) that determine the set of ^ 
physical register file choices for a particular operand. For 
predicated operations, the input specification may also 
specify a separate predicate input field type containing a 
predicate register file. 

The code listing below provides an example of the 
register file and operation format inputs sections of an 
ArchSpec: 

SECTION Register_File { 

gpr(width(32) regs(rO rl . . . r31) virtual(I)); 
pr(width(l) regs(pO pi . . . pl5) virtual(P)); 
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lit(width(16) intrange(-32768 32767) virtual(L)); 

} 

SECTION Field_Type { 
FT_I(regfile(gpr)); 
FT_P)rcgfile{pr)); 
FT_L(regfile(lit)); 

FT_JL(compatiblc_with(FT_J FT_L)); 

} 

SECTION Operation_Format { 

OF_intarithl(pred(FT_P) src(FT_I) dest (FT__I)); 
0F_intarith2(pred(FT_P) src(FT_IL FT_I) dest(FT_ 

I)); 

} 

The example shows that the above machine has a 32-bit 
general purpose register file "gpr", a 1 -bit predicate register 
file "pr" and a 16-bit literal (pseudo) register file "lit". Each 
register file can be used alone or in conjunction with other 
files in a field type specification as a source or sink of an 
operand. The field types for the predicate, source and 
destination operands are combined to form the valid opera- 
tion formats for each operation group. For example, the 
2-input ALU operation group "OG_aluO" (Sec "SECTION 
Operation_Group" above) has an operation format "0F_ 
intarith2", which specifies that its predicate comes from the 
predicate register file "pr**, its left input is an integer from 
either a literal register file or from a general purpose register 
file "gpr**, its right input is from "gpr" and its output is 
written to the general purpose register file "gpr**. 

The specification may also contain information defining 
additional architecture parameters: 

SECTION Architecture_Flag { 
predication_hw(intvalue(l)); 
speculation_hw(intvalue(0)) ; 
systolic_hw (intvalue(l)); 
tcchnology_scale(doublevalue(0.35)); 

} 

This section lists processor parameters indicating whether 
the processor architecture supports predication, speculation, 
and a systolic coprocessor. The last parameter is a technol- 
ogy scale, specifying a desired manufacturing level (e.g., ,35 
micron). The technology scale can be used to calculate the 
area of silicon required to manufacture the processor. For 
instance, when the silicon area is a design constraint on 
datapath synthesis, the synthesis process uses this informa- 
tion to evaluate the cost (e.g., chip area) of a particular 
design. The synthesis process may select functional units, 
for example, that satisfy a constraiot on the silicon area. 

6.4.2 Converting the ArchSpec to Internal Form 

The system converts the ArchSpec into an internal form 
that is easier to manipulate and traverse programmatically. 
The program module called the reader 14 reads the Arch- 
Spec and generates the internal form of the abstract ISA 
specification. 

In the implementation, the internal form of the abstract 
ISA specification provides a normalized representation of 
the ArchSpec in terms of ILP constraints. In particular, both 
exclusion and concurrency sets are expressed in terms of 
opgroup occurrences. To generate this normalized 
representation, the reader extends the ILP constraints as 
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follows. For each opgroup occurrence in a coDcunreacy placed on the same functional unit. With these design 

group (if any), the reader gives the opgroup occurrence a guidelines, the datapath synthesizer attempts to assign a 

unique name. The reader then collects all opgroups occur- single functional unit to operation groups that are exclusive, 

rences of one opgroup into a new exclusion group. Next, it and places concurrent operations on different functional 

expands each opgroup in an cxchision group by its set of all s units, as shown in the physical datapath representation 238. 

op^oup occurrences. . , , The physical datapath 238 in FIG. 2 includes six func 

The reader also carries over aU other properties from the ^.^^^^ ^^^^ FU12, 

ArchSpec including register files, field types, operation ^^^^^^ ^j^^^ ^^^^ 

formats, and architecture nags. In the abstract ISA , ^ , °r i * . *l • » r ,l 

' . , ^ . . selects each functional unit to meet the requirements or the 

specification, each opgroup occurrence behaves like an m j • * u tafm^ \Mr\\i ottVt- * m * *u 

^ 1 « « J « opcode instances, such as lADD, MOV, SHFT, etc. Next, the 

opgroup. Therefore, the terms opgroup and opgroup ^ „ * • . L r • * ci m j n 

„ , , 1 . L i_ . process allocates the register ports of register files 10 and 11 

occurrence may be treated synonymously in the subsequent f * • _* * /iu j • * n 

discussion j j j ™i to satisfy the port requests of the opcode mstances. Fmally, 

^ .„ ' , , . the process creates the interconnect based on the port 

To Illustrate the relationship between the abstract input .a^^ion. He Unes leading into the register files are write 

and the corresponding datapath output HG. 2 graphicaUy 15 ^^jj^ ^^^l^ gi^^ 3^ 

depicts an example of an input specification 234 and a ^ ^ ^^^^ j^^j ^^^^^er and type of functional 

corresponding datapath design 238. The datapath synthe- ^^^^^ ,/o ^^^^ fli^^^ ^j^^ ^ Ihe 

sizer 124 (FIG. 1) processes an input speciflcation like the „chitecture specification for a particular target machine. 

one eraphically depicted in FIG. 2 (e.g.. Item 234) to create ^ 

a physi^ datapath representation 238. which is shown as a ^ ■ Process of .^igning fiincUonal units to opcode 

set of fiinctional units (FU0O-FU12), register files 00-11), "f"^"^. "'^^^^y 5*"°^ '.P"'=^''' '° 

and the interconnect topology between them. As shown in '"P'?! f*'^' f?^^/'"' f^^^^' "^''"^^^ °^ "'^ 

this example, the input specification 234 provides the I^DD. MOV, LAND IMUL, and SHFT operaUon groups, 

*• „ lie J ™ -c *u -^^t^ which are not mutually exclusive, arc placed on separate 

desired operation groups 235, and specifies the instruction . , . „, ' . . ' , ^. i . . .■ 

„™ „ tu^o^ ™»«^» functional units. The pairwise exclusion relationships 

level parallelism among these groups as exclusion groups ns t»^tt j o^tt^ 

„ 1 Ann 11 T Axm 1ft oc ^J^u.oLr. between the IMUL and SHFT operation groups causes the 

(e.g., I ADD_11 and LAND_1U are related as an exclusion - . ^ . ^^.^j^ «n j oTTi-f^ *u 

group 236). Each operation group includes one or more s/nthesizer to place IMUL.OO and SHFT_00 on &e same 

opcode instances; for simph^ity, only a single opcode ^"■'f' l""^,^'' P°f^}l '"^ 8^°"*.^' ^ "^'^Pf*^ ''Pff 

iiistance is shown for each operation group. Each operation ^''^^ ^38 shows that the opcode instances of mutually 

group typically contains opcodes that have simUar resource 30 ^^'^f^^'^'tP^y^ of operation groups from the mput specifl- 

- ^ Ji* ru 1- la/: ™ cation 234 share functional units, 
requirements and latency. Each exclusion group 236 com- 
prises two or more operation groups (only two are shown The remaining components of the datapath 238, namely 
here) whose opcode instances are mutually exclusive, as the register files and their interconnect to the funcUonal 
iUustraled by exclusion marker 237 connecting the op ^ini^s, are synthesized on the basis of the register file and 
groups together. For instance, the opcode instances in opera- 35 operation formal specification present in the abstract ISA 
tion group IADD_U are mutually exclusive with the specification. For example, the operation format for I ADD - 
opcode instances in operation group LAND_10. When 01 and MOV-01 operation groups must specify that their 
operation groups are marked as mutually exclusive, the inputs are drawn fi:om register file 10 and its output is 
datapath synthesizer may force them to share processor deposited in register file II. Similarly, the operation format 
resources by, for example, assigning them to the same 40 IADD-10 and MOV-10 operation groups must specify 
functional unit. When these opgroups share a hardware that their inputs are drawn from II and outputs go to either 
resource, the compiler will not schedule them to issue 10 or U. This gives rise to the cross-connected function units 
concurrently. If operation group occurrences are marked as FU-00 and FU-10. 

being concurrent (e.g., in a concurrency set), the datapath An example of the textual description of these register 
synthesizer will synthesize the datapath so that these opera- 45 files and operation format specifications is provided below, 
tions may be issued concurrently. When the opgroup occur- 
rences are specified as part of a concurrency set, the com- SECTION Field Type { 
piler may schedule the corresponding operations to issue FT_IO(regfile(IO)); 
concurrenfiy. FT_Il(regfile(Il))i 

The datapath synthesizer 124 (FIG. 1) converts the 50 j 
abstract input specification into a machine-readable datapath SECTION Register File { 
represenlation^The datapath representation is a set of related ,^ virtual©); 
classes that define instances of functional units, register files, ^ , ^ ^ n 1 ^ • 
etc., and their interconnect, in the form of data buses, muxes, Il(width (32) regs (Ilr^ . . . llr^J vu-tual(I)); 
etc. The datapath representation may then be processed to ss ' 
produce a hardware description of the datapath in a hard- 
ware description language, e.g., a structural VHDL or Ver- SECTION Operation_Format { 

ilog description, which can be further processed to produce OF_intarith2_10_I0 (predOsrc(FT_IO FT_IO) dest 

a physical datapath, such as that shown in FIG. 2. (FT_IO)); 

Aprincipalobjectivcofdatapath synthesis is to maximize 60 OF_intarith2_I0_I0 (predQ src(FT_IO FT_IO) dest 

processor functionality and throughput without requiring (FT„IO FT_I1)); 

excess duplication of opcode instances and/or functional } 

unit instances. Since operation groups with 00 exclusion a ^ n t th n * 

relationship can be issued concurrently, the opcode instances . a ap ig 

within these operation groups must be placed on separate 65 FIG, 3 is a flowchart of an implementation of the datapath 

functional tmits. Conversely, the opcode instances of opera- synthesis process shown in FIG. 1. The abstract ISA spec 

tion groups that are marked as mutually exclusive may be 218 is a machine-readable data structure that specifies 
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register files, operation groups, ILP constraints, and archi- 15: RecursiveFindCliques(curreDtCliqueU{node}, 

lecture parameters. The datapath synthesis includes two prunedNodes); 

primary phases: 1) synthesis of the functional unit instances 16: H3: if (candidateNodes) c: NeighborsOf(node)break: 

(see steps 240, 242, 244, 246, and 24«) and 2) register file 17: H4: if (this is first iteration) startNodes=startNodes- 

and interconnect synthesis (see steps 252, 254, 256, 258, s NeighborsOf(node); 

260, 262, 264, and 266). 16: cndwhilc 

6.5.1 Functional Unit Synthesis 

In the synthesis of the fiinctional units, the first step is to i^e algorithm recursively finds aU cliques of the graph 

determine the maximal sets of mutuaUy-exclusive opera- starting from an initiaUy empty current clique by adding one 

tions based on the ILP constraints. In the current ^ode at a time to it. The nodes are drawn fnam a pool of 

implementation, the datapath synthesi2er finds these sets of candidate nodes which iniUaUy contain all nodes of the 

exclusive operations by setting up a graph of the exclusion gj-^ph. The terminating condition of the recursion Qmc 2) 

relations among operation groups and then finding cliques in checks to see if the candidate set is empty. If so, the current 

the graph. A clique is a wci ycno a ai. rnmri l trr snVn rr.tr rm ^^ ^Uque nodes is recorded if it is maximal (line 4), i.e. 

for j^maxima j^t of nodes in a praph, where each node m ^^^^ ^^^j. ^ojje ^ g^^ph that can be added to the 

gTset ^nnects w it h every other node in thaL-SSk In the g^t whUe stiU remaining complete. 

context""omr-e^cIusion graph, the chgues represent a |f candidate set is not empty, then the algorithm 

maximal set of operation ^roup nodes where the nofiration proceeds to grow the current clique with the various candi- 

groups are cxcln-^ive. with p.very pther nperntinn frrniip in the ^ ^^^^^ potential starting points. An exponential search is 

set. The connections among the nodes in the pranh represent performed at this point. Various heuristics have been pub- 

e;^lusionjclat ionsh i pa hetoeen the op £i at j2rLa^s. ^^^^^ growing the maximal cUques quickly and to avoid 

Exclusion cliques represent sets of operation groups that examining sub-maximal and previously examined cliques 

cannot be executed concurrently. In the current repeatedly. (See Ellis HoK)witz and Sartaj Sahni, "Funda- 

i mplementation, the process of finding cliques begins by 25 mentalsof Computer Algorithms," Compwrer Science /*res5, 

g enerating a boolean exclusion matrix that identifies the Rockville, Md., 1984.) The first heuristic (HI) checks to see 

excluslonnrelauonships oetween operation groups based on whether the current clique and the candidate set is a subset 

the ILF constraints. FIG. 4 illustrates an example of an of some previously generated clique. If so, the current 

exclusion matrix corresponding to the abstract specification procedure call cannot produce any new cliques and is 

234 from FIG. 2. The exclusion matrix for a given set of N 30 pruned. Otherwise, the algorithm continues to grow the 

operation groups wiU comprise an NxN matrix, where the current clique with the candidates one by one. 

rows and columns are both labeled with the same operation Each candidate node is processed for inclusion into the 

group identifier 39. Operation groups that are mutually current clique as follows. If the selected candidate forms a 

exclusive are then marked with a "1", while all other values complete graph with the current clique (line 13), the algo- 

are "0" (not shown here for clarity). By default, all of the 35 rithm adds it to the current clique and calls the procedure 

values along the diagonal of the matrix are set to 1 s, since recursively with the remaining candidates (line 15). The 

an operation group is assumed to be mutually exclusive with second heuristic (H2) is to restrict the set of remaining 

itself. The exclusion matrix values will always mirror about candidates in the recursive call to just the neighbors of the 

the diagonal, so that only one half of the matrix is actually current node since any other node will always fail the 

needed for processing. 40 completeness test within the recursive call. After the recur- 

It is possible to reduce the size of the problem by sive call returns, if the remaining candidate nodes are found 

collapsing nodes that are equivalent in terms of exclusion/ to be all neighbors of the current node, then the algorithm 

concurrency relations. can also prune the remaining iterations within the current 

After building the exclusion matrix, the datapath synthe- call (H3) since any clique involving any of those neighbors 

sizer executes a recursive algorithm on the matrix data to 45 must include the current node and all such cliques were 

find the exclusion cliques. The exclusion graph naturally already considered in the recursive call. Finally, if non- 

follows from the exclusion relationship expressed in the neighboring candidates are present, we can still drop the 

matrix. The recursive algorithm operates on this graph neighbors of the current node as starting points for the first 

according to the following pseudocode: iteration only (H4). 

50 While we have illustrated a specific example of firiding 

RecursiveFindQiques(currentClique, candidateNodes) cliques in a graph, there are other algorithms for accom- 

1: // Check if any candidate remains plishing this task. In addition, there alternative approaches 

2: if (candidateNodes is empty) then for finding sets of mutually exclusive operations that do not 

3: // Check if the current set of clique nodes is maximal involve cfiques. It is also possible to identify sets of con- 

4: if (currentclique is maximal) then 55 current operation group occurrences, and then assign FUs so 

5 : Record(currentClique); that the operation group occurrences in each set are assigned 

6: endif to different FUs. 

7: else After finding maximal sets of mutually exclusive opera- 

8: SlartNodes=Copy(candidateNodes); tion groups, the datapath synthesizer selects fiinctional imits 

9: while (StartNodcs is not empty) do 60 from a standard or user-specified macroccU library so that all 

10: HI: if (currentclique UcandidateNodes <= some previous of the opcodes occurring in each set are covered, i.e., able to 

Qique) break be executed on the selected fiinctional units. As shown in 

11 : node«pop(StartNodes); FIG. 3, the current implementation selects functional units to 

12: candidate Nodes==candidateNodes-[nodes]; cover the exclusion cliques (see step 242). Next, the data- 

13: if (currentcliqueU{node} is not complete) continue; 65 path synthesizer instantiates the selected functional units as 

14: H2: prunedNodes=candidateNodesnNeighborsOf shown (step 246). In building the functional imits in this 

(node); manner, the objective is to optimize the selection of func- 
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tional unit instances so that all of the required opcodes are 
still supported while maintaining the exclusion requirements 
defined by the cliques. In some cases, it may not be possible 
to map individual cliques to a single functional unit, thereby 
necessitating the use of multiple functional units to support 
the opcode requirements of the clique. Pseudocode for 
covering the cliques and building the functional units is 
listed below: 



BuildFUs (VLIWArch, listOfCliques) 


1: 


foreach (OPG e VLTWArch) 


2: 


build valid ListOfFUs(Opset(OPG)) from Database; 


3: 


// match opcodes, latency 


4: 


foreach (OPG € VLIWArch) 


5: 


foreach (usedFU € ListOfFUs(Opset(OPG) 


6: 


ListOfOpsets(uscdFb) +« Opset(OPG); 


7: 


while (listOfCliques is not empty) 


8: 


find (bestFU € usedRJs) such that 


9: 


forsome (clique € listOfCliques) 


10: 


maxCoveredOPOs - {OPG | OPG c clique, 


11: 


Opset(OPG) € ListOfOpsets(bestFU) } 


12: 


HI: size(maxCoveredOPGs) is maximum 


13: 


H2: Brea(bestFU) is minimum 


14: 


instantiate(bestFU); recoid(maxCoveredOPGs); 


15: 


foreach (clique e listOfCliques) 


16: 


clique — maxCovcrcdOPGs; 



The first task is to build a valid list of functional units 
from the raacrocell database that wiU support the opcode and 
latency requirements of each of the operation groups of the 
VLIW architecture specification (line 1-2). And conversely, 
for each function unit, the code identifi^es the list of opera- 
tions that it can possibly cover (line 4-6). For example, if the 
database contains an ALUO functional unit that can perform 
ADD, SUBTRACT, and MOVE opcodes, and an ALUl 
functional unit that can perform ADD and MOVE opcodes, 
then 

ListOfOpsets(ALU0)={ADD, SUBTRACT, MOVE}; 
ListOfOpsets(ALUl)-{ADD, MOVE}; 
ListOfFUs(ADD)={ALU0, ALUl}; 
ListOfFUs(SUBTRACr|={ALU0}; 
ListOfFUs(MOVE)-{ALU0, ALUl}. 

At each iteration of the while loop starting at line 7, a FU 
is selected that best covers the operation groups of a remain- 
ing clique. The criteria for selection in this implementation 
use two heuristics. First, heuristic HI favors FUs that cover 
the maximimi number of remaining operation groups out of 
any remaining clique. The second heuristic H2 selects the 
FU that is of minimum area. Other heuristics may be used 
to optimize timing, power consumption, routability, geom- 
etry (for hard macros), etc. 

The rest of the algorithm selects a set of FUs to be 
instantiated in the datapath, one by one, by looking at the 
requirements of the operation group cliques provided. Once 
the FU has been selected, it is instantiated in the datapath 
and the operations that it covers are recorded. Finally, the 
covered operation groups are eliminated from each of the 
remaining cliques and the cycle repeats until all cliques are 
covered and eUminated. 

The next step 248 identifies which FUs out of the selected 
set require a memory port by checking their properties 
stored in the macrocell database. This step is necessary in 
order to identify the number of ports required to connect to 
the memory hierarchy. The memory hierarchy refers to the 
processor's memory design. The memory hierarchy may 
include, for example, a level 1 (LI) data cache, a level 2 (L2) 
data cache and global memory. 
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6.5.2 Register File and Interconnect Synthesis 

Referring again to FIG. 3, the right side of this diagram 
illustrates the process of synthesizing the register files and 
inter-connect topology. Using the architecture specification 
^ as input, this process albcates register file ports and builds 
the interconnect to the functional units. As shown in steps 
252 and 254, the datapath synthesizer builds a set of 
read/write port coimcction requirements for connecting the 
functional units to the register files for each type of register 
file in the VLIW specification, including literal register files 
(LRFs). The datapath synthesizer extracts these require- 
ments from the format specification of source/sink operands 
of varioTis operations mapped to the corresponding func- 
tional units. 

Many of these register file port connections may be shared 
based upon the mutual exclusion specification of the corre- 
sponding operation groups. As an example, assume that we 
want to build read/write port requirements for a machine 
specified by the following description: 

SECTION Operation_Group { 

OG_alu_0(ops(ADD SUB) format(OF_intarith2)); 
OG__alu_l(ops(ADD SUB) format(OF_Jntarith2)); 

25 

OG_move_0(ops(MOVE) format(OF_intarithl)); 
OG_movc l(ops(MOVE) fonnat(OF_intarithl)); 

} 

SECTION Exclusion Group { 
30 EG_0(opgroups(OG_alu_0 OG_move_0) 
EG_l(opgroups(OG_alu_l OG_move__l) 

} 

SECTION Register_File { 

gpr(width(32) regs(rO rl , . . r31) virmal(I)); 
pr(width(l) regs(pO pi . . . pl5) virtual(P)); 
lit(width(16) intrange(-32768 32767) virtual(L)); 

} 

SECTION Field_Type { 
FT_I(regfile (gpr)); 
FT__P)regfile^r)); 
FT_L(regfile(lit)); 

FT__IL(compatible_with(FT_I FT_„L)); 

45 } 

SECTION Operation_Format { 

OF__intarithl(pred(FT_P) src(FT_I) dest (FT_I)); 

0F_intarith2(pred(FT _J>) src(FT__IL FT_l) dest(FT_ 
50 I)); 
} 

In this example, there are four operation groups that 
require two operation formats: OF_intarithl, and 

55 0F_intarith2. The Opera tion_Format section provides the 
register file port requests for each of these operation formats. 
First, the datapath synthesizer translates operation group 
port requests to FU port requests based on the mapping of 
operation groups to FU instances decided earlier. 

60 There are alternative ways to map operation group port 
requests to FU port requests. One approach is to map all 
opgroup port requests to corresponding FU port requests and 
then have one RF port request per FU port request. In an 
alternative approach, each opgroup occurrence is mapped to 

65 its own RF port request. In this case, the datapath synthesizer 
applies afiSnity allocation of RF port requests to FU port 
requests. Affinity allocation is described further below. 
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Next, the datapath synthesizer builds a conflict graph The aUocation heuristic is a variant of Chaitin's graph 

where the nodes are the resource requests (e.g., register file coloring register allocation heuristic. See, Chaitin, C, J., 

port requests) and the edges in the graph are conflict RegisterAllocation & Spilling Via Graph Coloring, ACM 

relationships among the requests. In the implementation, the 1982. Chaitin made the following observation. Suppose G is 
datapath synthesizer builds a concurrency matrix between s * conflict graph to be colored using k colors. Let n be any 

each of the FU port requests, whfle taking into account the ° having fewer than k neighbors, and let G' be the 

exclusion relationships among the operation groups. The ^.t^P^ formed from G by removing node n. Now suppose 

rows and columns of the concurrency matrix correspond to ^^"^ ^ '^^'^ k-coloring of G'. This coloring can be 

the port requests, and each element in the matrix corre- ^'^^''^'^ » k-colormg of G by sunply assigmng 

spends to a pair of pert requests. At each element, the matrix lO ° ^ ^ ^^"^ "'f^^^' ^ 

stores a binary vahie reflecting whether or not there is a """^ *^ ^»«nteed to exist since n has fewer than k 

resource conflict between the pair of port requests. neighbors Stated another way, a node and its w neighbors 

The next step is to allocate the read and write ports as can be colored with w+1 or fewer colors 

shown in steps 256 and 258. To allocate these porU, the '° implementation, each FU port is viewed as 

datapath synthesizer executes a resource allocation algo- 15 'ndependent resource requester, requesting a single 

ri±m. In particukr, the resource albcation algorithm in the "^^omct, namely, a register flk data port. In an alternative 

current implementation uses a technique based on graph implementation each FU could request multiple ports for a 

coloring to allocate the minimum number of read/write ports given register file that correspond to the vanous operation 

for each register file that will satisfy aU connection requests. 8™"?^ "^"^^J", f^' J" '^1 "^^i ""'^ T ^ 

Pseudo code for this resource allocation algorithm is 20 requests would be defined to have afiBmty between them to 

listed below allow them to be preferably allocated to the same register file 

port. This would reduce the interconnect needed to connect 

ResourceAlloc(nodeRequ6sts, conflictGraph) ^° _ , 

, / . , , In the above pseudocode, the total resource request for a 

// compute resource request for each node+neighbors ^^^^ .^^ ^^.^^^^ ^^^^^^^^ j^^p 

foreach (nodeeconflictGraph) do heuristic repeatedly reduces the graph by eliminating the 

Mark(node)=FALSE; node with the current lowest total resource request (node 

TotalRequest(nod6)-Rcqucst(node)+Request plus remaining neighbors). At each reduction step, we keep 

(NeighborsOf(node)); track of the worst-case resource limit needed to extend the 

Mocat6dRcs(Qode)=empty coloring. If the minimum total resources required exceeds 

endforeach the current value of k, we increase k so that the reduction 

// sort nodes by increasing remaining total resource process can continue. The graph reduction is performed by 

request the second loop. Nodes are pushed onto a stack as they are 

// compute upper-bound on resources needed by alloca- removed from the graph. Once the graph is reduced to a 
tion 35 single node, we begin allocating register ports (resources) to 

resNeeded^O; Stack==EMPTY; ^^^^^ processed in stack order, i.e. reverse 

r rt c n ^ -Kj Kr J / a- J reduction Order. At each step, a node is popped from the 

for (k from 0 to NumNodesfconmctGraph)) do ^ , j jj j . *u * a- * u * % 

n J / • KT J 1 J J \ u *C \ stack and added to the current conflict graph so that it 

find (minNodeeunmarked nodes) such that „. - i.. r • • r u *u ^ - 

™, , ,V, • XT J \ • • • conflicts with any neighbor from the original graph that is 

TotalRequest(minNode) is minimum; . ■ .1. . a- . t. t-u • n 

1/ • XT J \ 'TT^TTi^ present in the current conflict graph. The existmg allocation 

Mark(minNode)=TRUE; 40 r , , , . . . • T ^ * ^ *t. * 

h/" M H ^ extended by assigmng register ports to satisfy the current 

?eTNe?ded=max(r*e^Needed, TotalRequest(minNode)); f^.'J""'' "^'°« . ports disjoint from ports 

foreach (nhbreNeighbo.30f(minNode)) do »f 'g"^^' '^'l^T ' ""^ " 

^ , ^n \ n */ • vr J \ showu in the third loop. 

TotalReQuest(nhbr)-=Request(minNode); u • j - *u ■ 1 * *• r * *• 

Endforeach ^ ^ ^ ^ One heuristic used in the implementation favors 'contigu- 

endfor aUocation'. This heuristic simplifies interconnect layout 

, . , by allocating register ports to contiguous positions. Another 

//process nodes m reverse order (i.e., decreasing total t^^^stic is 'affinity allocation'. The afSnity aUocation 

request) attempts to assign port requests to the same register port 

while (Stack is not EMPTY) do coming from same FU port for different operation groups. 

node=pop(Stack); The foUowiag heuristics pseudo code illustrates afiBnity 

AllResources={0 . . . resNeeded-1}; aUocation. Each node has a set of afiBnity siblings. The 

// avaUable resources are those not already allocated to implementatioa attempts to assign the same port to affinity 

any neighbor sibhngs as foUows: 
Availabl6Res(node)=AllResources-AllocatedRcs 

(NeighborsOf(node)); if node is tentatively aUocated then 

make tentative allocation permanent, if possible 

// select requested number of port requests from avail- 
able ports if node is (still) not allocated then 
// according to one of several heruristics ^^y to use a sibling allocation 
AllocatedRes(node)=Choose Request(node) resources 

^^^^ if node is (still) not allocated then { 

AvailableRes(node) ^U^^^^^ contiguously, 

mm: Contiguous Allocation ^j- e^ch sibling of node { 

[HH2: Affinity Allocation 65 if sibhng is allocated then 

end try to use node's allocation in place of existing 

return resNeeded; allocation 



10/06/2003, EAST version: 1.04.0000 



us 6,4( 

35 

else 

tentatively allocate sibling, using node's allocation 
}/i for 

} 

After allocating the register file ports, the datapath syn- 
thesizer builds the register files by selecting appropriate 
register file macrocells from the macrocell database 244 to 
satisfy the read/write port allocation. The synthesizer selects 
from a macrocell database individual register file instances 
(general purpose register files, predicate register files, etc.) 
each with a number of ports which correspond to the 
read/write port allocation to build the register file(s) of the 
machine. It then stores the resultant register file instances as 
a set of classes in the processor description 232. 

As shown in step 262, the datapath synthesizer records the 
register file to functional unit port allocation as an internal 
data structure 266. Next, the datapath synthesizer builds the 
interconnect as shown in step 264. In building the 
interconnect, the synthesizer selects macrocell instances of 
wires, buses, muxes, U-i-slales, etc., so as to satisfy the 
register file to functional unit port allocation. 

The VLIW datapath processing produces a set of 
classes of functional unit macrocell instances, register file 
macrocell instances, and interconnect component instances, 
e.g., wires, muxes, tri-state buffers, etc. FIG. 5 shows an 
example of the output graphically depicting the datapath 
synthesis process. In this example, the abstract input 267 
specifies operation groups LAND_00 and IADD__00. The 
"pr? gpr, gpr s: gpr" entry is the operation format for the two 
operation groups. 

General purpose register (gpr) 270 has three control 
address line inputs arO, arl, and awO, two data inputs drO and 
drl, and one data output dwO. The gpr provides input to and 
receives output from a functional unit 272 through intercon- 
nects 274 and periphery circuitry, including sign-extend 
literal 276, multiplexor 278, and tri-state buffer 280. The 
control inputs 282, which are undefined at this point, control 
these components. The functional unit 272 comprises a 
functional unit cell instance, such as an ALU, selected from 
a standard or user-specified macrocell database. 

While FIG. 5 shows instances of only a single register file 
(gpr) and functional unit cell instance, the actual output of 
the datapath extraction will typically comprise a variety of 
register files and FU cell instances. 

6.5.3 Extraction of an Abstract ISA Spec from a 
Datapath Spec 

As shown in FIG. 1, the system may extract the abstract 
ISA spec from a datapath specification. This enables the 
system to perform a variety of design scenarios that are 
based on the abstract ISA specification, or a combination of 
the abstract ISA specification and the datapath specification. 
For example, the system can proceed to generate the VLIW 
processor's instruction format, extract its MDES, build its 
control path, and select custom templates after extracting the 
abstract ISA specification. 

Given a VUW datapath specification, this step extracts 
the information corresponding to an Abstraa ISA Specifi- 
cation. The register file macroccUs from the datapath speci- 
fication directly provide the register file specification for the 
Abstract ISA Specification. The set of opcodes that can be 
executed by each Functional Unit (FU) macrocell define the 
Operation Groups. The 1/0 Formats for the opcodes are 
determined by examining the FU macrocells' connectivity to 
the Register File (RF) macrocells. The ILP constraints, and 
in particular, the mutual exclusions between the Operation 



)8,428 Bl 

36 

Groups, are calculated by analyzing the sharing of RF ports 
by the FU macrocells. At this point, all of the information 
needed by the Abstract ISA Specification is present. 

6.6 MDES Extraction 

^ The MDES extractor programmatically generates a 
description of a processor suitable for re-targeting a com- 
piler. This description includes the operation repertoire of 
the target machine, the input/output storage resources where 
the operands of each operation reside, sharing constraints 
among hardware resources used during execution of an 
operation such as register file ports and buses and their 
timing relationships that are expressed as a reservation 
pattern. 

FIG. 6 illustrates an MDES extractor module that pro- 

15 grammatically extracts a machine description for 
re -targeting a compiler. In the process of extracting an 
MDES, the extractor 300 may obtain information firom an 
abstract ISA spedficalion 302, a structural datapath speci- 
fication of a processor 304, and a macrocell database 306. 

20 The abstract ISA specification provides the operation rep- 
ertoire of the target machine, an ILP specification, the I/O 
format of the operations in the repertoire, and a register file 
specification. The ILP specification identifies the ILP con- 
straints among the operations in terms of concurrency and/or 

25 exclusion sets. 

The extractor 300 creates a machine description 308 of a 
target processor from the structural description of the pro- 
cessor's datapath provided in the datapath specification 304. 
In the current implementation, the MDES is in the form of 

30 a database language called HMDES Version 2 that organizes 
information into a set of interrelated tables called sections 
containing rows of records called entries, each of which 
contain zero or more colunms of property values called 
fields. For more information on this language, see John C. 

35 Gyllenhaal, Wcn-mei W. Hwu, and B. Ramakrishna Rau. 
HMDES version 2.0 specification. Technical Report 
lMPACT-96-3, University of lUinois at Urbana-Champaign, 
1996. 

The MDES 308 provides information to re-target a com- 

40 piler 310 to a target processor. The form of HMDES enables 
the use of a "table driven" compiler that has no detailed 
assumptions about the structure of the processor built into its 
program code. Instead, it makes queries to a machine 
description database containing the MDES of the target 

45 processor. For more information about the re-targetable 
compiler and MDES, see Bantwal Ramakrishna Rau, Vinod 
Kathail, and Shail Aditya, Machine-description driven com- 
pilers for EPIC processors. Technical Report HPL-98-40, 
Hewlett-Packard Laboratories, September 1998; and Shail 

50 Aditya, Vinod Kathail, and Bantwal Ramakrishna Rau. 
Elcor's Machine Description System: Version 3.0. Technical 
Report HPL-98-128, Hewlett-Packard Laboratories, 
October, 1998, which are hereby incorporated by reference. 
The datapath specification may be specified manually. 

55 However, in the current implementation, the extractor 300 is 
part of an automated design system that programmatically 
generates the datapath specification from the abstract ISA 
specification. The datapath design process uses a textual 
version of the abstract ISA called the ArcbSpec to generate 

60 a structural description of the datapath using hardware 
components from the macrocell database 306. The macro- 
cclls arc in AIR format and point to actual HDL descriptions. 
In addition, each macrocell has a corresponding MDES 
information, referred to as a mini-MDES, The mini-MDES 

65 provides the operation repertoire of each macrocell, its 
latency, internal resource constraints and input/output port 
usage. 
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Tbe extractor 300 synthesizes the overall MDES of the 
target processor by combining the mini-MDES information 
from each of the functional units during a structural traversal 
of the datapath specification 304. 

For each functional unit macrocell, its mini-MDES from 
the database is instantiated and added to the target MDES. 
The actual input/output connectivity of each operation 
present in the mini-MDES is determined by structurally 
traversing the corresponding input/output ports of the mac- 
rocell until a register file port or a literal port is encountered. 
For each multiplexor output feeding into an input port of a 
functional unit, all its input signals are explored as 
alternatives, and vice versa for the output ports. All shared 
resources, such as register file ports, that are encountered 
during the traversal are identified and added as additional 
columns to the reservation table of the operation with an 
appropriate time of use. The time of use is determined by 
taking into account the latency of the operation that is 
provided in the operation's mini-MDES and any pipeline 
latches encountered during the structural traversal. In this 
manner the composite reservation pattern of each alternative 
of the operation is built by combining information from the 
mini-MDES and the structural connectivity of the target 
machine. Finally, additional resource parameters, such as the 
number of registers in the register files, their widths and 
presence of speculation tag bits etc. is also recorded in the 
overall MDES by examining the appropriate components of 
the target architecture. This completes the target MDES 
synthesis which can then be used to drive the compiler. 

The extractor may also augment the MDES with opera- 
tion issue conflicts stemming from the instruction format by 
traversing a structural description of the control path 312. 

6.6.1 Structure of the MDES 

In the current implementation, the MDES includes the 
following information: 
Operation Hierarchy 

As shown in FIG. 7, the operations visible to the compiler 
are organized in a hierarchy starling from the semantic 
operations (320)(present in the program) consisting of 
semantic opcodes and virtual registers, down to the archi- 
tectural operations (322)(implemented by the target 
machine) consisting of architectural opcodes and physical 
registers. The intermediate levels include generic operations 
(324), access-equivalent operations (326), opcode-qualified 
operations (328), register-qualified operations (330), and 
fully -qualified operations (332) that are organized in a 
partial lattice called the Operation Binding Lattice. This 
hierarchy abstracts the structural aspects of the machine and 
allows the compiler to successively refine the binding of 
operations within the program from the semantic level to the 
architectural level making choices at each level that are legal 
with respect to the target machine. The terms used in the 
operation hierarchy are defined as follows: 
Architecmral Operations 

Architectural operations are commands performed by the 
target processor. 
Semantic Operations 

Semantic operations are operations present in the source 
program each with a predefined meaning (e.g, ADD per- 
forms the signed binary add operation). 
Architectural Registers 

Architectural registers are either literals or registers in the 
target processor. 
Virtual Registers 

A virtual register is a machine independent representation 
of a variable present in the source program. 
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Compiler-registers 

A compiler register is either a single architectural register 
or a set of architcctiual registers, with a fixed spatial 
relationship, that are viewed as a single entity by the 

5 compiler. 

Generic Register Sets 

A generic register set is a maximal set of compiler- 
registers that have the same storage attributes. 
Access Equivalent Register Sets 

In general, the phrase "access equivalent" refers to a set 
of processor resources that are equivalent with respect to 
their structural accessibility. InformaUy, registers are "access 
equivalent" if they have similar resource usage patterns, 
latency, and instruction format constraints. To describe how 
this concept is specifically used in our MDES, it is necessary 

^5 to explain a few other terms. An "alternative'' in this context 
refers to a triple consisting of a "compiler opcode" (see 
below), a "latency descriptor" (see below) and a "reservation 
table" (see below) that are jointly valid for a target processor. 
A "register set tuple" (RS tuple) is a tuple of register sets, 

20 such that each register set is a subset of a single generic 
register set (i.e. all the registers have the same storage 
attributes). An "access-equivalent RS tuple" corresponding 
to a given alternative is a maximal RS tuple, where each 
register set correponds to one of the operands of the com- 

25 piler opcode and every register tuple in the Cartesion 
product of the register sets is jointly valid with that 
alternative, taking into account both the connectivity con- 
straints of the processor as well as the instruction format 
constraints. Each register set in the access-equivalent RS 
tuple is an "access equivalent register set." 

For every choice of the register tuple in the access 
equivalent RS tuple, along with the compiler opcode of the 
alternative, the resulting operation has the same latency 
descriptor and the same resource reservation table, since all 
of the register tuples are accessible with the same alteraa- 
live. Consequently, each access equivalent register set con- 
tains registers that are interchangeable with respect to that 
opcode after scheduhng has taken place; any register can be 
used in place of any other without any impact on the 
correctness of a scheduled piece of code. Also, since all 

40 register tuples implied by an access equivalent-tuple are 
architecturally valid, the compiler register for each operand 
can be independently selected by a register allocator in the 
compiler 

Compiler Opcodes 

45 A compiler opcode is an abstraction over architectural 
opcodes, and is implemented by one or more architectural 
opcodes. This abstraction provides a more convenient way 
of representing an operation in the compiler. For example, a 
register-to-register copy may be implemented in the 

50 machine by either adding zero or multiplying by one. It is 
more convenient to represent this copy operation in the 
compiler as a single compiler opcode, rather than the spe- 
cific architectural opcode or a set of opcodes that implement 
it in the target processor. 

55 Generic Opcode Set 

A generic opcode set is the maximal set of compiler 
opcodes that implement the same function, e.g., integer add. 
Access Equivalent Opcode Set 

An access equivalent opcode set is the maximal set of 

60 compiler-opcodes that arc part of the same generic opcode 
set (i.e. implement the same function) and for each of which 
there is an alternative that yields the same access equivalent 
RS-tuple. 

Operation Descriptors 
65 Operations at each level of the hierarchy are characterized 
by several properties that are also recorded within the 
MDES. These include the following. 
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Operation Formats 5. post pass scheduling; and 

Along with each operation, the MDES records the sets of 6 code emission, 

registers and literals that can source or sink its various input Each phase successively refines and narrows down the 

or output operands respectively. A tuple of such sets, one for options available for either opcodes, registers, or both, 

each operand, is called the operation formal. The size of 5 finally yielding architectural operations that can be executed 

these sets becomes larger as we climb the operation by the processor. These options are represented in a hierar- 

hierarchy, ranging from the exact set of registers accessible chical data structure called the operation binding lattice 

from each macroccll port implementing an operation at the shown in FIG. 2. Note that semantic and architecmral 

architectural level to a set of virtual registers containing aU operations (320, 322) are shown m the figure, but they are 

architectural registers at the semandc level. lO of the lattice. Tliey are used to show "implemen- 

Latency Descriptors relaUonships ; semantic operations (320) are unple- 

c u ■ , A * ^ J f c mcntcd by generic operation sets (324) and architectural 

Each mput and output operand ol an operation specifies a . ^- ^ n ic j /■yt^\ 

^ r 1 . • - * J -^L * 1 J J operations implement fully-qualified operations (332). 

set of latencies associated with its sample ^d production following sections describe how the re-targetable 

umes respectively relative to the issue time of the operation. .^^^ ^ ^^^^^^ operaUons to 

In addiUon, a few other latencies may be recorded based on 15 architectural operations, 

the semantics of the operation (e.g., branch latency, or Code Selection 

memory latency). These latencies are used during operation jhe code selection phase maps semantic operations (320) 

scheduling to avoid various kinds of timing hazards. to generic operation sets (324), i.e., it maps semantic 

Resources and Reservation Tables opcodes and virtual registers to generic opcode sets and 

The various macrocells present in the datapath, the reg- 20 generic register sets, respectively. Note that the mapping 

istcr file ports, and the interconnect between them are from semantic opcodes to generic opcodes is not, in general, 

hardware resources that various operations share during one-to-one. 

execution. Other shared resources may include operation Pre-pass Operation Binding 

issue slots within the instruction register, pipeline stages or At this point, the generic operation sets (324) may contain 

output ports within the macrocells. Each operation within 25 multiple access-equivalent operation sets (326), each con- 

the MDES carries a table of resources, called a reservation sisting of an access-equivalent opcode set along with its 

table, that records the resources it needs at the appropriate access-equivalent RS-tuple. Such operations need to be 

cycle times. This table is used during operation scheduling ^^^^^ t'ound down to a single access-equivalent operation 

to avoid structural hazards due to sharing of resources. ^J}' ^ the pre-pa^ operation binding phase. 

Opcode Descriptors 30 constraint that must be satisfied is that each operation m 

The structural and semantic properties of opcodes at each computation graph has to be annotated with an access 

, 1 ^ - , . . 1,5 • »*T^r-o equivalent operation set in such a way that, for every 

level of the hierarchy are ako kept within the MDES. Th^ ^^^.^^^ the intersection of the access^uivalent register 

properties include the number of input and output operands, ^^^^ .^^^^^ ^^^^ operations that access it, 

whether or not the opcode can be speculated and/or ^^^^^^ access-equivalent register option set, must be 

predicated, whether or not it is associative, commutative etc. 35 non-empty 

Register Descriptors Scheduling 

Similarly, several properties of registers and register files The scheduling phase is one of the main phases of an 

are recorded at each level of the operation hierarchy includ- EPIC code generator. For each operation, the scheduler 

ing the bit-width, whether or not speculative execution is decides the time at which the operation is to be initiated. It 

supported, whether the register (or register file) is static, 40 also determines which compiler-opcode is to be used as well 

rotating, has literals etc. as the reservation table and latency descriptor that are used 

by the operation, i.e., it picks a specific alternative. In the 

6.6.2 Phases of the Re-Targetable Compiler ^^^se of statically scheduled EPIC machines, the scheduUng 

Before describing MDES extraction in more detail, it is phase refines access-equivalent operation sets (326) to 
helpful to begin by explaining the compiler's view of the 45 opcode-qualified operation sets (328), i.e., operations in 
target processor. The compiler needs to know, for each which the possible alternatives have been narrowed down to 
opcode, which registers can be accessed as each of its source a particular one, as a consequence of which the opcode 
and destination operands. Additionally for an EPIC options have been narrowed down to a single compiler- 
processor, the compiler needs to know the relevant operand opcode, but the register options are unchanged, 
latencies and the resource usage of these operations. With" so Register Allocation 

the compiler's needs in mind, the MDES serves the follow- The register allocation phase assigns a specific compiler- 

ing two needs: 1) To assist in the process of binding the register to each of the virtual registers in the computation 

operations and variables of the source program to machine graph by selecting one of the compiler registers from the 

operations by presenting an abstract view of the underlying corresponding access-equivalent register set. This yields 

machine connectivity, and 2) to provide the information 55 fully-qualified operations (332), i.e., a specific alternative 

associated with each operation needed by the various phases and a specific compiler-register tuple, 

of the compiler. The register allocation phase may introduce additional 

The re-targetable compiler maps the source program's ^^^^ ^P^^^ registers to memory. The spill code is fiilly- 

operations to the processor's architectural operations. ^^^^^ registers are concerned, but it has not 

-nie re-targetable compiler used with the current imple- been scheduled. Thus, after this phase, the program com 

™«t-,t;^« «f KAntrc l^t^.^i ^ v « ■ « two types of operations. First, it contams operations that 

mentation or the MDES extractor performs this mappmg in , v * t /. « . /^ . 

the foUowine phases- « have been nanowed down to fuUy-quauficd operations 

1 code selection* (332). Second, it contains spill operations whose operands 

' are fully bound to compiler-register tuples, but whose 

2. pre-pass operation binding; ^5 opcodes are stiU at the level of access-equivalent opcode 

3. scheduhng; sets. We call such operations register-qualified operation sets 

4. register allocation and spill code insertion; (330). 
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Post-pass Scheduling 

A second pass of scheduling, called post -pass scheduling, 
is used to schedule the spill code introduced by die register 
allocator. This phase has a choice for ihe sdieduling of 
fully-qualified operations (332): it can either keep the 
opcode bindings selected by the earlier scheduling phase or 
it can start afresh by reverting all compiler-opcodes back to 
their original access-equivalent opcode sets thereby convert- 
ing them to register-qualified operations (330). The latter 
strategy gives more freedom to the scheduler in accommo- 
dating the spill code and yields better schedules. Post-pass 
scheduling deals with code containing virtual registers that 
are fully bound to compiler-registers. It is greatly 
constrained, therefore, by a host of anti- and output depen- 
dences. However, since the register assignments were made 
subsequent to the main scheduling phase, they are already 
sensitive to achieving a good schedule. 
Code Emission 

The final phase is the code-emission phase. This phase 
converts fully -qualified operations to architectural opera- 
tions. This is a bookkeeping step and no decisions are made 20 
by this phase. 

6.6.3 Extracting the MDES 
6.6.3.1 Mini-MDES Components 

In order to facilitate MDES extraction directly from the 25 
datapath components, each macrocell in the macrocell data- 
base carries a mini-MDES which records the MDES-related 
properties shown above for the architectural opcodes that it 
implements. The mini-MDES is organized just as described 
above except that it contains only one level of the operation 30 
hierarchy, the architectural level, and that there are no 
registers and register descriptors. Instead, the operation 
formal of an architectural operation is described in terms of 
the input/output ports of the macrocell used by each of its 
operands. 35 

For each operand of a given operation, the mini-MDES 
also records the internal latency through the macrocell. If the 
macrocell is a hard macro, the latency may be accurately 
modeled as absolute time delay (nanoseconds), or in case of 
soft macros, approximately as the number of clock cycles 40 
relative to the start of the execution of the operation. 

For each operation, the mini-MDES records any shared 
internal resources (e.g., output ports, internal buses) and 
their time of use relative to the start of the execution in an 
internal reservation table. This table helps in modeUng ^5 
internal resource conflicts and timing hazards between 
operations. For example, if a macrocell supports multiple 
operations with different output latencies that arc channeled 
through the same output port, there may be an output port 
conflict between such operations issued successively to this 
macrocell. Recording the usage of the output port at the 
appropriate time for each operation allows the compiler to 
separate such operations sufficiently in time so as to avoid 
the port conflict. 

Finally, the mini-MDES of a macrocell also reflects 
whether the macrocell implements speculative and/or predi- 
cated execution capability by incorporating such opcodes 
within itself. The macrocell selection process may choose 
maaocells based on the presence or absence of such capa- 
bflities. Note that a macrocell supporting speculative execu- 
tion and/or predicated execution may be used in place of one 
that does not, but its cost may be somewhat higher. 

6.6.3.2 Extracting Global MDES From The 
Datapath 

The MDES extractor extracts a compiler-centric machine 
description from the datapath of the machine. It collects the 
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information contained in the mini-MDESes of the various 
functional unit macrocells and the MDES-related properties 
of the register files present in the datapath into a single 
global MDES. It then augments the extracted MDES with 
the topological constraints of the datapath such as connec- 
tivity to shared buses and register file ports. A pseudocode 
listing illustrating the implementation of the process of 
extracting the MDES from the datapath is provided below. 



Procedure ExtractMdes (DaUpath dpath) 

1: Mdes globalMdcs = nullMdes; 

2: for (component e dpath) do 

3: if (component is a FU macrocell) then 

4; PortAltMap altMap = nullMap; 

5: Mdes miniMdes = componenLMiniMdcsO; 

6: //accumulate the mini-Mdes operations into the global mdes 

7: for (operation e. miniMdes) do 

8: CompilerOpcode opcode » a copy of operation.opcodeO; 

9: global Mdes .tnstallopcode(opco(le) 
10: for (each input/output operand of operation) do 
11: OperandAlts opdAIts = nuUList; 
12: RcservationTiible opdResv = nul liable; 

13: OperandLatency lat = a copy of ope ration. Op dLatency(operand); 

14: Mcellport port = operation. OperancfroMcellport(operand); 

15: //accumulate mdes properties by traversing the datapathfrom this port 

16: if (this port has not been traversed before) then 

17: TraversePort(port, lat, opdResv, opdAlU); 

18: // save operand alternatives for this port 

19: aUMap.bind(pori, opdAlts(; 

20: else 

21: OpdAlta - altMap.value(port); 
22: endif 

23: opcode RecordOperandAlternatives (operand, opdAlts) 
24: endfor 

25: //build operation alternatives as a cross product of operand 
alternatives 

26: opcodc.BuildOpcrationAltcmatives(openition); 
27: endfor 

28: clscLf (component is a rcgistci file) then 

29: //accumulate rcgistci file properties into the global mdes 

30 : globalMdcs .InstallRcgistcrFile(component) 

31: endif 

32: endfor 

33: //build a hierarchy of operation alternatives for each semantic 
operation 

34; BuildOperationHierarchyCglohalMdes); 
35: return globalMdes; 



The extraction process starts by initializing the global 
MDES of the machine to an empty MDES (line 1, FIG. 8, 
340). Then, for each component of the datapath (344) that is 
a functional unit macrocell, the extractor installs its mini- 
MDES architectural opcodes as compiler opcodes within the 
global MDES to form the lowest level of the opcode 
hierarchy (line 9, FIG. 8, 346). Various semantic and struc- 
tural properties of the opcode including semantic opcode 
name, commutativity, associativity, number of input and 
output operands, bit encoding are also copied into the 
corresponding opcode descriptor. 

Likewise, for register file components of the datapath, the 
extractor installs the various architectural registers as com- 
piler registers into the global MDES to form the lowest level 
of the register hierarchy along with a register descriptor (line 
30, FIG. 8, 348) that records the structural properties of the 
register file. Most of these properties are determined either 
from the type of the hardware component used (e.g., whether 
or not speculative execution and/or rotating registers are 
supported), or from its structural instance parameters (e.g., 
the number and bit-width of static and rotating registers). A 
few remaining properties are carried forward from the 
archspec (e.g., the virtual file type). 

The MDES-related details of the operations implemented 
by a fimctional unit macrocell are collected as foUows. For 
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each input or output operand of a machine operation, the 
extractor collects a set of "operand alternatives". This set is 
obtained by first mapping the operand to its corresponding 
macrocell port at which it is received or produced (method 
call OperandToMcellPort at line 14), and then traversing the 
datapath components connected to that port (procedure call 
TraversePorl at line 17). Operands mapped to the same port 
share the same alternatives and hence datapath traversal 
needs to be performed only once per port. The details of this 
traversal and the generated operand alternatives are provided 
later. 

The sets of operand alternatives so obtained are then 
combined into "operation alternatives" (method call Build- 
OperationAltematives at line 26) (FIG. 8, 356). This is done 
by taking each tuple in the Cartesian product of the sets of 
operand alternatives for the given operation and combining 
its operand properties to form operation properties. The 
operand field types are concatenated to form an operation 
format, individual operand latencies are collected to form 
the complete operation latency descriptor, and the operand 
reservation tables are unioned together with the internal 
reservation table of the operation into an overall reservation 
table for that operation alternative. As described below, the 
field types of the various operand alternatives partition the 
compiler registers of the machine into access-equivalent 
register sets. Therefore, the operation alternatives formed 
above correspond to an opcode-qualified compiler operation 
consisting of a compiler opcode and a set of access- 
equivalent register-set tuples. All such distinct operation 
alternatives are installed into the global MDBS as alterna- 
tives for the given compiler opcode. 

Procedure TraversePort(McellPort thisport, OperandLa- 
tency lat, ReservationTable resv, OperandAlts opdalts) 



]: //Assume one-to-one connections among ports 

2: if (thisport is INPUT port) then 

3: case (predecessor component connected to thisport) of 

4: multiplexor: //accumulate all field type choices 

5: for (each inputport of the multiplexor) do 

6: Traverser ort(inputport, lat, resv, op d Alts); 

7: endfor 
8 

9: e-multiplexor: //add a resource calumn to reservation table 

10: Resource res - Resource(inputport of the de-multiplexor); 

11: ReservationTable resv' - resvAddColumn (res,lat); 

12: TraversePort(inputport, lat, resv', opdAlts); 
13; 

14: pipeline latch: // add one to latency 

15: Identify inputport of the latch; 

16: ReservationTable resv' = rcsvj\ddRow(lat); 

17: OpcrandLatency lat' = lat.AddLatcncy(l); 

18: TravcrscPort (inputport, tat', resv', opdAlts) 

19: 

20; register/literal file: // base case 

21; FicldTVpc ftypc = FicldTVpe (file. Registers Q); 

22: Resource res = Rcsource(outputport of the register file); 

23: ReservationTable resv' = resvAddCoIumn(rcs, lat); 

24: OpdAlts .addAlt(ftype, lat, resv'); 

25: endcase 

26: else //thisport is OUITUT port (symmtiic case) 
27: 

28: endif 



6.6.4 Datapath traversal 

An important aspect of the above MDES extraction 
scheme is the datapath traversal routine TraversePort shown 
in pseudocode form above which extracts the operand 
alternatives associated with a given functional unit macro- 
cell port. We only show the input port traversal since it is 
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symmetric for output ports. For simplicity, we also assume 
that only one-to-one coimections exist between the input and 
output ports of various datapath components, i.e., multiple 
sources to an input port are connected via a multiplexor, and 

5 multiple sinks from an output port are connected via a 
de-multiplexor. It is straightforward to extend this to many- 
to-many connections by treating such connections as mul- 
tiple sources multiplexed onto a bus that arc de -multiplexed 
to the varioxzs sinks. 

10 Each operand alternative is a triple consisting of the 
following information that characterize the macrocell port 
and the hardware structures surrounding it: 

1. The field type of the operand, which describes a set of 
compiler registers that are the potential sources of the 
operand and that are equally accessible from the input port. 

2. Hie operand latency descriptor, which contains the 
earliest and latest sampling latencies of the operand with 
respect to the issue time of the operation. This may be 
different for different sources reaching this port or even for 
the same sources reachable via different paths. 

3. The operand reservation table, which identifies any 
shared resources used for accessing this operand (e.g., buses 
and register file ports) and their time of use relative to the 
issue time of the operation. 

The strategy for collecting the operand alternatives for a 
given macrocell port is as follows. The operand latency of 
the various alternatives is initialized using the macrocell 
mini-mdes and their reservation table is set to empty. Start- 
up ing from the macrocell port, the extractor then traverses the 
various datapath components connected to it in a depth-first 
traversal until an operand source such as a register file or 
literal instruction field is reached. As hardware components 
such as multiplexors, de-multipclxors, pipeline latches and 
registers files are encountered during the traversal, their 
effect is accumulated into the operand latency and the 
reservation table as described below. 

A multiplexor (line 4) at the input port serves to bring 
various sources of this operand to this port and therefore 
4Q represents alternate field types and latency paths leading to 
different operation alternatives. The MDES extractor per- 
forms a recursive traversal for each of the inputs of the 
multiplexor. 

The effect of a demultiplexer (line 9) at the input is to 

45 distribute data from a shared point (such as a shared input 
bus) to various macrocell ports. This is modeled by intro- 
ducing a new resource column in the reservation table 
corresponding to this shared data source. A check is placed 
at the current latency row to show that this new resource is 

50 used at that latency. The input of the demuUiplexor is 
followed recursively. 

A pipeline latch (line 14) encountered during the traversal 
adds to the sampling latency of the operand as well as affects 
the operation reservation table by adding a new row at the 

55 beginning. The input of the latch is recursively traversed to 
identify the source of the operand. 

Finally, a register file port or a literal instruction field (Une 
20) is the point where the recursion terminates. All the 
registers (literals) accessible via the register file port (fiteral 

60 field) form an access cqxiivalent register set and become part 
of the field type of the operand. The register file port (literal 
field) itself is recorded as a shared resource being accessed 
at the current latency by adding a resource column to the 
current reservation table. The triple consisting of the field 

65 type, the operand latency, and the reservation table is accu- 
mulated into the list of operand alternatives for this macro - 
cell port. 
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6.6.5 Building operation hierarchy al a desired abstract instruction set architecture or custom- 

^ ^ , . , . ized instruction templates, and the final MDES extraction 

Hie final step in .he MDK extraction process is to ^ compflation may be postponed until after the processor 

complete the higher levels of the opcode register and j^, ^„ ^ j 

Operation hierarchy withm the global MDES (procedure call . .... . „ . 

BuildOpcrationHierarchy at Une 34 of the ExtractMDES ^ , Another possibility is to superimpose the MDES extracted 

pseudocode). This process constructs the higher levels of the f^-i ^'^''^P^^ 'J,"? '^^ '^^'^P'''^- ^" '^'^ approach, the 

operation binding lattice (OBL)(FIG. 8, 358). « constructed m two phases: 

nie process of constructing operand alternaUves shown 1) Phase one extracts the MDES from the ArchSpec. 

above already identifies the compUer registers, and the jo P^^^'^ is constructed; 2) Phase two 

access-equivalent register sets. In order to complete the ""g'"'''"* the MDES produced in phase one with physical 

register hierarchy, all distinct access-equivalent register sets constraints obtained from a traversal of the 

implementing a particular data type (e.g.. floating point, '^j'^P^^^ representauon. Th|s has the advantage of 

integer, and boolean) are collected to form a generic register '^j^ issue-Ume ILP constraints into account as well as 

set which implements the semantic notion of a virtual ,5 f"'^^. the PhysiciJ resource constramis based on actual 

register in the program latencies obtained from the mmi-MDES of the macrocells. 

Next, the corresponding levels in the opcode hierarchy are 6.6.7 Extracting MDES from the Control Path 

constructed using the register hierarchy. First, all compiler , r . • . tt 

opcodes implementing the same semantic opcode (as iden- . Anothe r way to account for the issue -time LP constramts 

tified by its opcode property) are collected into a generic 20 ^ ^ P^^^°"° a traversa^ of the structural control path 

opcode set which forms the top layer of the opcode hierar- representaUon. Tlie control path rcpresentaUon reflects issue 

chy. Any operation altemaUve pointed to by a compiler constraints because it is constructed based on the 

opcode within this generic opcode set is a valid implemen- instruction format, which m turn, represents the issue tune 

tation of the corresponding semantic operation. However, constraints in the instrucUon templates^ The process of 

not all such alternatives are equivalent in terms of their 25 ^xtractrng Oie issue-tune ILP constraints from the control 

operand accessibiUty. Therefore, the set of operation alter- P^^*^ ^ explained further below 

natives pointed to by a generic opcode set is then further g g g j^^ES Extraction Examples 
partitioned into sets of access-equivalent alternatives that 

use the same access-equivalent register-set tuples. The com- As described above, the MDES extractor prepares the 
piler opcodes present in each such partition form a distinct 30 reservation tables by stniaurally traversing the datapath to 
access- equivalent opcode set which constitutes the middle identify both internal and external resource sharing con- 
layer of the opcode hierarchy straints. Another example of such a structural traversal is 

„ ■ 1 ct. r I.- L ■ illustrated with reference to a datapath 450 of FIG. 9. The 

Finally, the missing layers of the operation hierarchy, i.e., , ^ • 1 j . 1 j • c c » 

. . * • r * * J datapath 450 includes structural descriptions of functional 

genenc operation sets, access-equivalent operation sets, and ^ ^--,t^ • j-^ , - . 

° Tc J ? u L n • *L 1*; units 472A, 472B, a register file 470, and interconnections 

register-quahfied operation sets may be built using the . , »u a ^- 1 viTtA yn^n a.u • . 

* J- 1 r *u -a J *L • * u • between the functional units 472A, 472B and the register file 

correspondmg layers of the opcode and the register hierar- .-^ „ . .1 _, .^T^^-n ^ 

u- 1 4U ^ •^ 4 4U 1 * 470, as well as other macrocells. The MDES extractor 

chies. In the current implementation, these layers are not . /. . .1 ^. 1 1, 

^;™4K, ™™^^«t^^ ;^J^^A iu^., „™ , u^iiUr obtains reservation tables for the functional unit macrocell 

directly represented, instead they are implicitly referenced . ^ r « i l ^ i.*T^r^r. 

via the 0 code hierarch instances from the macrocell library. The MDES extractor 

P ^* 4Q then programmatically synthesizes the latency specification. 

6.6.6 Extracting MDES from the Abstract ISA Connections to/from a given functional unit are determined 

using the structural description for the processor datapath. 

The MDES may also be extracted from the abstract ISA The connections of the inputs and the outputs of the func- 

specification provided in the ArchSpec. While the ArchSpec tional units are traversed along the buses and wires specified 

does not provide a structural representation of the datapath, in the datapath until a register file port or Literal file port is 

it does provide the opcode repertoire, the I/O format of each reached. The connectivity of all the functional units is 

opcode, and the ILP constraints among the operations. The similarly determined by structurally traversing the wire 

ILP constraints can be used to extract abstract resource interconnect toward a register file port or a literal file port, 

constraints needed to re-target the compiler. For iastance, an The two functional units 472A, 472B may share a single 

exclusion set may be thought of as representing an abstract register file port directly (such as port drl), or a single 

processor resource shared by each opcode in the exclusion function unit input (such as port il of 472A) may obtain its 

s^^' input from both a register file port (such as port d 0) and a 

Simple pipelined reservation tables may be constructed literal file port using a multiplexor (MUX), as selected with 

for each opcode using such shared abstract resources as if a control input (ctrl) 482^?. Structural traversal of the func- 

thcy represented a functional unit instance used at cycle 0. 55 tional units proceeds through the MUXes to a register file or 

An assumed latency is assigned to each operand of an literal file represented by a sign -extension unit 476. If a 

opcode rather than extracting actual latency information MUX output is directed to an input port of a functional unit, 

from the mini-MDES of a macrocell. all the inputs to the MUX are considered as alternatives. 

The MDES extracted in this manner is accurate only with Conversely, for a DEMUX input from an output port of a 

respect to the opcode repertoire, the 1/0 format and the 60 functional unit, all the outputs of the DEMUX are consid- 

operation issue constraints imposed by the ArchSpec and ered as alternatives and traversed (not shown), 

may only be used as a functional approximation to the The datapath 450 illustrates several potential resource 

complete MDES extracted from the datapath. In particular, conflicts and choices. The data inputs of functional imits 

it does not model any structural or timing hazards arising 472 A, 472B are connected to output (read) ports drO, drl and 

from the physical resources of the machine. This is still 65 the data output is connected to an input (write) port dwO of 

useful, for example, in application specific processor design the register file 470 via interconnect buses 474fl-c. The 

where a quick retargeting of the compiler is needed to arrive opcode repertoire of functional imit 472A includes opcodes 
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LAND, lADD; input data for these opcodes is supplied at 
functional unit input ports iO, il. The port iO receives data 
from the port drl of the register file 470 via the interconnect 
bus 4746. The port il receives input data from either the port 
diO or a literal input from a sign-extend macrocell instance 
476 as selected by a control input (ctrl) 4S2b to a MUX 478. 
The output of the functional imit 472A is driven onto the 
interconnect bus 474c by a tristate buffer 480a in response 
to a control input 482a. 

The opcode repertoire of the functional unit 472B 
includes the opcode SORT, (square root) which receives an 
input at an input port iO from the port drl of the register file 
470. The output of the SQRT opcode is delivered to the input 
port dwO of the register 470 file through a tristate buffer 
480b that is controlled by the control input 482c. The 
functional units 472A, 472B both receive data from the port 
drl of the register file 470 and write data to the port dwO of 
the register file 470. Therefore, the functional units 472 A, 
472B share the ports drl, dwO. The tristate buffers 4S0a, 
4H0b are provided to prevent the functional units 472A, 
472B from supplying their outputs to the bus 474c simul- 
taneously. 

To begin extraction of the MDES tising the datapath 
shown in FIG. 9, the MDES extractor structurally traverses 
the interconnections of the functional units 472 A, 472B. The 
operation group mapped to ALU instance 472A contains two 
operations, LAND (logical and) and lADD (integer add). 
The operation format for these operations stored within the 
mini-MDES shows the macrocell ports used for their various 
operands. The mini-MDES also records the sampling and 
production times of the various input and output operands 
that are intrinsic to the macrocell. Let us suppose tbat it is 
0 for each data input sO and si, 1 for the predicate input sp, 
and 2 for the data output dO (assuming that the macrocell is 
pipelined). Finally, the mini-MDES records that these opera- 
tions execute on the same macrocell and share its compu- 
tation resources. This is represented by an internal reserva- 
tion table with a shared "ALU" resource for the two opcodes 
used at cycle 0 assimfiing that the macrocell is pipelined. 

The datapath traversal starts from the actual input and 
output ports of the macrocell instance 472A. Following 
input port iO, we find that it is directly connected to the gpr 
register file port drl, introducing a shared resource column 
for that register port to be used at cycle 0, which is the 
sampling latency of this input operand. The field type 
accessible via this port is denoted by "gpr" which stands for 
all the registers contaiaed in the register file gpr 470. This 
operand alternative is recorded temporarily. 

The input port il of the macrocell instance is connected 
via a multiplexor 478 to the gpr register file port drO as well 
as a sign-extender 476 for the short literal instruction field. 
This gives rise to two distinct operand alternatives, one witb 
field type "gpr" at latency 0 using the gpr file port drO, and 
the other with field type "s" at latency 0 tising the literal 
instruction field connected to the sign-extender. Similarly, 
the predicate input gives rise to the operand alternative with 
field type "pr*' at latency 1 using the pr file port (not shown), 
and the destination port oO gives rise to the operand alter- 
native with field type "gpr" at latency 2 using the gpr file 
port dwO. The various operand alternatives are combined to 
form two distinct operation format and reservation table 
combinations for the ALU macrocell, as shown in FIGS. 
10 A and lOB. 

Note that the overall latencies of the operands are the 
same as the intrinsic macrocell port usage latencies since 
there are no external pipeline latches. Also, the ALU 
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resource is marked as being used only at cycle 0 since the 
macrocell is pipelined and the usage of subsequent stages of 
the ALU pipeline at subsequent cycles is implicit. The above 
combinations of operation formats, latencies, and rescrva- 

5 lion tables apply to both lADD and LAND opcodes, thereby 
forming two distinct operation alternatives each. These 
alternatives would be combined with other ahematives from 
other macrocells to give rise to the complete operation 
hierarchy for these opcodes. 

10 In its structural traversal, the MDES extractor also pre- 
pares a reservation table for the functional unit 472B. This 
reservation table is illustrated in FIG. 11. An internal reser- 
vation table is extracted from the macrocell library. For 
purposes of illustration, the SQRT unit is assumed to be 

15 non-pipelined and have a latency of 4 clock cycles, i.e.,. an 
output is produced 4 clock cycles after an input is received. 
The fact that the unit is non-pipelined is reflected in the 
internal reservation tabic by keeping the "SQRT" resource 
busy for 4 cycles (see the column labeled SQRT in FIG. 11). 

2^ The structural traversal of the datapath proceeds as before. 
The input iO is followed to the register file 470, and a column 
893 added to the SQRT reservation table 891. The output oO 
is then traversed to the tristate buffer 4H0b and then to the 
port dwO. Corresponding column 894 is then added to the 

25 SQRT reservation table 891. Structural traversal of the 
functional unit 472B is complete. 

At this point, the reservation tables for the functional units 
472A, 472B are complete. The MDES extractor installs the 
resource conflict data included in the reservation tables on 
the respective opcodes within the MDES, completing 
MDES extraction. 

Operation issue conflicts stemming from the instruction 
format may also be added to the above reservation tables in 

25 the following way. The MDES extractor repeats the above 
process after the instruction format for the target machine 
has been designed and the corresponding controlpath and 
instruction decode logic has been iaserted (described in the 
Control path Application referenced above). Now, the data- 
path traversal is carried through the register files back up to 
the instruction register treating the register files like pipeline 
latches. The latency of the register files may cause one or 
more rows to be added at the beginning of the reservation 
table automatically corresponding to instruction decode and 

^5 operand fetch cycles. The traversal paths leading towards the 
same bit positions in the instruction register would end up 
recording an operation issue conflict. 

Alternatively, one may direcUy represent the operation 
group exclusions prescribed in the ArchSpec as shared 

50 abstract resources that are used at cycle 0 and, therefore, 
model operation issue conflict for the mutually exclusive 
operation groups. The cunent implementation uses this 
approach since it is simpler than traversing the control path 
representation, and it de-couples the extraction of the MDES 

55 and its use in scheduling application programs from instruc- 
tion format design and control path design processes. 

6.7 Instruction Format Design 

6.7.1 Introduction 

60 

FIG. 12 is a flow diagram illustrating the insU^ction 
format design flow in an automated processor design sys- 
tem. While this particular system is designed for the syn- 
thesis of a VLIW processor and its associated instructions, 
65 it also illustrates how a similar design system might be 
implemented for a single-issue processor. At a high level, the 
system takes a high-level processor architecttire specifica- 
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lion 510 as input, and automatically produces a complete ILP specified in the input specification. Each concurrency 

hardware description of the target processor design, includ- clique represents a set of mutually concurrent operation 

ing a machine instruction set. The system is implemented in groups, such that one of the operations in each operation 

a scries of program modules. FIG, 12 provides an overview group in the set may be issued in parallel with one of the 
of these modules, and the following description details an S operation from each of the other operation groups. The 

implementation of them. system then extracts instruction templates from the concur- 

The high-level input specification 510 provides a desired rency cliques as shown in step 520. 

set of machine operations together with an abstract specifi- For each of the operation groups, the iformat system 

cation of the concurrency and resource sharing constraints extracts the inputs and outputs for each operation based on 

between them. A concurrency constraint identifies which iheir I/O formats in the input specification and adds this 

operations are allowed to be executed at the same time, information to the iformat data structure as shown in step 

while a resource sharing constraint identifies which opera- 522. Using the extracted I/O formats, the iformat system 

tions cannot be executed at the same time. To generalize enumerates the instruction fields for each of the operation 

these concepts, we refer to them as instruction-level paral- groups associated with the templates, 
lelism (ILP) constraints. The ILP constraints may be speci- is Before allocating bit positions to each of the instruction 

fied (1) directly as sets of concurrent operations, (2) as a set fields, the iformat system sets up a Bit Allocation Problem 

of pair-wise exclusions between operations, or (3) as some (BAP) specification as shown in step 524 in FIG. 12. In this 

combination of exclusions and concurrency sets. The ILP process, the iformat system uses the ILP constraints and 

constraints specify the amount of instruaion level parallel- datapath specification to generate the data structures in the 
ism within the processor directly in terms of which opera- ^ BAP specification 526. The set-up process shown in FIG. 12 

tions are allowed to execute in parallel and which ones may includes the following sub-steps: 1) building an instruction 

share a processor resource. The input specification may be format (IF) tree; 2) determining instruction field conflict 

entered by a user or generated by another program module. constraints; 3) partitioning instruction fields into superfields; 

The instmction format design is based in part on the and 4) extracting instruction field bit requirements from the 

design of the target processor's datapath. Before the instruc- datapath. The output of the set-up process includes: 1) the 

tion format design process begins, a datapath design process insUiiction field conflict constraints 528; 2) a partitioning of 

512 generates the datapath design 514 from the input the instruction fields into superfields 530; and 3) the bit 

specification 510. The current implementation includes soft- width requirements 532, 

ware components that automate the datapath synthesis pro- The conflict constraints identify which fields are mutually 

cess. The instruction format design then creates the instruc- exclusive and can be allocated overlapping bit positions and 

tion formal based on the high level input 510 and the which fields need to be specified concurrently in an instruc- 

datapath specification. tion and hence cannot overlap. Fields that are needed 

Based on the high level input 510 and datapath concurrently in an instruction are said to conflict with each 

specification, the instruction format (iformat, for short) oflier. 

design process builds a data structure 516 representing the The set-up process 524 assigns instruction fields to con- 
instruction format. The instruction formal includes a speci- trol ports specified in the datapath. It then groups each set of 
fication of the different types of instructions supported in the instruction fields that map to the same control port into a 
processor, called instruction templates. In the implementa- superfield. These superfields enable the iformat design sys- 
tion detailed below, the templates define variable-length tem to attempt to aUgn these instruction fields at the same bit 
instructions, but they can also represent fixed-length instruc- position in a process referred to as affinity aUocation. The 
tions. need for multiplexing is minimized if fields assigned to the 

Every instruction template is made up of concatenated same superfield are assigned to the same bit positions, 

instruction fields, which encode one or more operations. The process of partitioning instruction fields into super- 
each including an opcode, source operand(s) and destination 45 fields identifies fields that should preferably share bit posi- 

operand(s). In some processor designs, the fields may tions. The iformat system enables a user or another program 

include additional bit specifiers that control the data path, module to specify fields within a superfield that must share 

such as multiplexor selector bits, and an instruction identi- bits through an input data structure shown generally as 

fier (e.g., a template ID field that identifies the instruction). instruction field affinity information 534 m FIG. 12. 

The iformat system associates the instruction fields v^th the The set-up process 524 extracts bit width requirements by 

underlying processor control ports and calculates their bit traversing the fields and extracting the bit width require- 

width requirements. ments and encodings for each field from the datapath 

In addition to enumerating these fields, the instruction specification, 
format assigns bit positions and encodings to each of them. Once the instruction format syntax and instruction field 
Tho bit positions are specific positions that each field 55 bit width requirements have been determined, the system 
occupies in an instruction. The encodings are the binary allocates bit positions to all fields as shown in step 536. 
values associated with the instruction fields. For example, an Fields are allocated using a heuristic that allows non- 
opcode field is associated with binary values that select a conflicting fields to re-use bit positions, resulting in a shorter 
particular opcode. overall instruction size. Fields are also aligned based on 

In the process of designing the instruction format, the 60 affinity, i.e. fields associated with the same datapath 

iformat system selects a set of templates based on the resources are aligned to the same bit positions within the 

concurrency relationships from the input specification. Each instruction register, resulting in reduced control complexity 

template consists of one or more operations based on which in hardware. 

operations are allowed to be issued in parallel (concurrently) As shown in FIG. 12, the resulting instruction format 
in the architectural specification and which ones are speci- 65 includes instmction templates, instruction fields, and the bit 

fied to be mutuaUy exclusive. As shown in step 518 in FIG. positions and encodmgs of these fields. After bit allocation, 

12, the iformat system builds concurrency cUques from the the internal iformat data -structure 516 may be output in 
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various forms for use by other modules of the overall desired ILP constraints to the automated iformat design 

processor design system. For example, one program module process. Additionally, it may be used to optimize an existing 

540 generates an external file format 542, which is used to iformat design. 

drive an assembler. Another module 544 generates a report The system may also perform additional optimization by 
in the form of an instruction set manual 546. s ^sing variable-length field encodings to further reduce the 

In some applications, the iformat system may be used to instruction size. These optimized designs can lead to dra- 

optimizc an existing concrete ISA specification. In this matic reductions in code size, as shown in the detailed 

scenario, an existing instruction format forms part of a description below, 
concrete ISA specification 550, and the iformat system uses 

the concrete ISA along with custom templates to generate an ^ '^ ^ Implementation of the Input Specification 
optimized iformat programmatically. In addition to the xhe principal input of the iformat design process is an 
iformat, the concrete ISA specification contains a register Abstract Instruction Set ArchitecHire (ISA) specification 
file specification 552, including register files, the number of 510. In the current implementation, the user or another 
registers in each file, and a correspondence between each program module may provide this specification as an Arch- 
operand instruction field type and a register file type. To -^^ j^xtuai fonn. 

optimize the format in a concrete ISA specification, the ^ ArchSpec reader module converts the textual form of 

system begms by extracting an abstract ISAspecifi^^^ the ArchSpec to an abstract ISA spec data structure, which 

shown in FIG. 12, the system includes a module 554 for ^^^^^^^ ^ machine-readable set of tabular parameters and 

extracting an abstract ISA specification from the concrete constraints, including register file entries, operation groups, 

SA specification, llie system then combmes the extracted 20 exclusion/concurrency relationships. 
ISA specification with the additional ILP specification to 

create the input specification for the iformat design flow. The 6.7.3 Instruction Syntax 

additional ILP specification provides a fist of important ttt7 • • . l • w 1 

,t,v.r, ™» «.«-*.«^<.o VLIW processors issue instructions having multiple 

concurrency sets and operation group occurrences, rhese . , ^ u * • ^ ^ ^ u • . n.-* 

, 11 - _i * u * f mstruction fields. An instruction field IS a set of bit positions 

concurrency sets represent statistically important subsets of " . t . 1 * . , • • 

,L * «u * 1 J ♦ • *u * mtended to be interpreted as an atomic umt within some 

the concurrency sets that are already present m the concrete • . . . 1 ^ ^ u 

Tc^., . , r 4 1 TTn mstruction context. Familiar examples are opcode fields, 

ISA s instruction rormat. For example, this ILP specification j j . . unit j t-. i 

1 ^ L- u * A source and destination register specifier fields, and literal 

may represent custom templates 556, which are generated . ^ . --.l n c ■ . 

, u J *• 11 * *• A fields. Bits from each of these fields now from the instruc- 

by hand or programmatically. Ine output is an optimized . , . . »_ j . 1 

, \ ^ / * .1- • * . *i. ij ** 1 TT n tion register to control ports in the data path. For example, 

instruction format, taking into account the additional ILP . , . ^ . • j f 

. ° opcode bits now to functional units, and source register bits 

^ flow to register file read address ports. Another common 

To the extent that the iformat design is based upon an jy^^ instruction field is a select field. Select fields encode 

application-specific architecture specification, it is a choice between disjoint alternatives and communicate this 

application-specific but "schedule-neutral." The phrase context to the decoder. For example, a select bit may indicate 

"schedule-neutral" means that statistics detailmg the usage ^j^^^j^^^ ^^^^^^^ ^^^^ ^ interpreted as a register 

of operations m an appHcation program of mtercst have not specifier or as a short Htcral value. 

been used to optimize the iastruction format. . • n * r •* 

An operation is the smallest umt 01 execution; it com- 

To optimize an iformat design for a particular application ^^^^ opcode, source operands, and destination operands, 

program, the iformat system selects custom templates from ^^^^ operand may support one or more operand types. A set 

operation issue staUstics obtamed from scheduling the pro- of possible operand types iscaUed an io-set. AUst of io-sete, 

gram. The iformat system then generates an iformat based ^ne per operand, form an operation's io-format. For 

on a combmation of the custom templates and an abstract example, suppose an add operation permits its left source 

ISA specification. operand to be either an integer register or a short Uteral 

The system uses a re-targetable compiler to generate the 45 value, and suppose its right source and destination operands 

operation issues statistics for a particular processor design. source and sink from integer registers. The corresponding 

As shown in FIG. 12, a module caUed the MDES extractor io-sets are {gpr, s }, {gpr }, {gpr }. The io-format is simply 

560 generates a machine description in a fonnat called this list of io-sets, which are abbreviated in shorthand 

MI^ES. notation as follows: 

This machine description retargets the compiler 564 to the 50 gpr s, gpr:gpr 

processor design based on its abstract ISA specification 510 Closely related operations such as add and subtract often 

and datapath specification 514. The compiler 564 then have the same io-format. One reason for this is that related 

schedules a given application program 566 and generates operations may be implemented by a single, multi-function 

operation issue statistics 568 regarding the usage of the unit (macro-cell). As discussed above, to simplify the 
operation groups in the instruction format templates. The 55 insUiiction format design process, related operations are 

system then uses the frequency of use of the operations in grouped into operation groups. 

each template by the application program to compute cus- The instruction format assigns sets of op groups (called 

tomized templates as shown in step 569. The customization super groups) to slots of an instruction. The processor issues 

process is automated in that it selects custom templates by operations within an instruction from these slots concur- 

minimizing a cost function that quantifies the static or rently. To fully specify an operation, the instruction format 

dynamic code size and the decode cost (e.g., measured in specifies both an op-group and an opcode (specific to that 

chip area). opgroup). In effect, this organization factors a flat opcode 

The process of selecting instruction templates in the name space into a multi-tier encoding. In rare cases, this 

iformat based on scheduling statistics may be conducted as factorization may increase the encoding length by one bit 
a stand-alone process, or may be conducted in conjunction 65 per level. However, it should be noted that this approach 

with the automated iformat design process. In the latter case, does not preclude a fiat encoding space: placing each 

it may be used to provide an initial input specification of the operation in its own op -group eliminates the factorization. 
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More importamly, hierarchical encoding often gives the select field ("steer") to choose among the various operation 

same benefits as variable-length field encoding, but is sim- formats supported by operation group. FIG. 13 shows this 

pier to implement. situation where one operation format allows a literal field on 

the left port, while the other allows it on the right port. 

6.7.4 The Instruction Format Tree r , *• r . / c * j • * ^aa 

5 Each operation format (e.g., 10 format descnptors 644, 

In a flat, horizontal instruction format, all instruction 646) is an AND node consisting of the opcode field 654, the 

fields are encoded in disjoint positions within a single, wide predicate field (if any) 656, and a sequence of source and 

instruction. A hierarchical instruction formal allows exclu- destination field types (shown as 10 sets 648, 650, 652). The 

sive instruction fields (those that are not used simultaneously traditional three-address operation encoding is defined at 
in any instruction) to be encoded in overlapping bit lo this level. 

positions, thereby reducing the overall instruction width. In Each 10 set is an OR node consisting of a singleton or a 

the instruction format design system shown in FIG. 12, the of instmctioo fields that identify the exact kind and 

hierarchical relationship between instruction fields is reprc- location of the operand. 10 sets with multiple choices (e.g., 

sented by an instruction format tree (if-tree). The leaves of 550) have a select field to identify which instruction field is 
an if-tree are instruction fields; where each leaf points to a 35 intended. For example, one of the 10 set nodes 650 repre- 

control port in the data path, such as a register file address gents a selection between instruction fields 660, 662, which 

port, or an opcode input of a FU. controlled via a multiplexor select field 664. The other 10 

FIG. 13 illustrates the structure of an if-tree used in the sets each have only one kind of field, and thus, have a single 
current implementation. The overall structure of the tree child node representing that field (nodes 658, 666). The 
defines how each instruction is built. Each part of the tree instruction fields point to the datapath control ports 668. 
represents a node, with the lowest nodes (the cut-off-box- implementing an instruction format, one principal 
shaped nodes) formmg the tree's leaves. The oval-shaped ^j^g^gj, ^i^^^^^ whether to use a single, fixed-length instruc- 
nodes are "OR" nodes, while the boxed-shaped nodes are tjoj, format, or aUow variable-length instructions. The ifor- 
"A>rD" nodes. The OR nodes denote a selection between the j^at design system supports both fixed and variable length 
chHdrcn of the node such that only one choice (one branch) instructions. The use of variable -length instructions pro- 
extends to the next level. Conversely, an AND node allows jj^^es more-compact code but increases decode complexity, 
all of the components of the node to form new branches. trade-off between code size and instruction decode 
Stated another way, each level of the tree is either a complexity is a primary design consideration. A single, 
conjunction (AND) or disjunction (OR) of the subtrees at the fixed-length instruction format simpUfies decode logic and 
lower level. jj^e ^ja^a path for dispersal of operations to functional units, 

The root node 632 of the tree is the overall machine but it often results in poor code density, since the single 

instruction. This is an OR node representing a choice of format must accommodate the worst-case (longest) instruc- 

instruction templates. A template select field (template ID) is tion. For example, if the longest instruction in a fixed-length 

used to identify the particular template. This select field is instruction format is 128 bits long, then all of the instruc- 

illustrated as the leaf node labeled "steer*' connected to the tions in the instruction set must be 128 bits long. In order to 

instruction node 632. maintain a constant instruction length, many instructions 

Individual instructions are based on instruction templates, will require the use of wasted bits whose sole purpose is to 

which are the AND -type child nodes of the root node (See, fill in unused space in the instructions. These wasted bits 

e.g., templates 634 and 636). The templates each encode the lead to increased code size. Conversely, variable-length 

sets of operations that issue concurrently. Since the number instructions can accommodate both wide and compact, 

of combinations of operations that may issue concurrently is restricted instruction formats without wasting bits, which 

astronomical, it is necessary to impose some structure on the results in a reduction in code size. By using variable-length 

encoding within each template. Hence, each template is instructions, the instruction formal can accommodate the 

partitioned into one or more operation issue slots. Every widest instructions where necessary, and make use of 

combination of operations a.ssigned to these slots may be compact, restricted instruction formats, such as instructions 

issued concurrently. that do not encode long literals. 

In addition, each template has a consume to end-of-packet FIG. 14 shows the format of an instruction and its 

bit field (CEP) that indicates whether the next instruction building blocks. At the heart of the instruction is an instruc- 
directly follows the current instruction or it starts at the next 50 tion template 670. An instruction template encodes sets of 

packet boundary. This capability is used to align certain operations that issue concurrently. Each template includes 

instructions (e.g. branch targets) to known address bound- multiple concurrent slots 672, where each slot comprises a 

aries. Each template also specifies the number of spare bits set of exclusive operation groups 674. Since all of the 

that may be used to encode the number of no-op cycle to operations in an operation group are exclusive, all of the 
follow the current instruction. These spare bits may arise due 55 operations in each slot are also exclusive. Each template 

to a need for packet alignment or quantized allocation. encodes the cross-product of the operations in each of its 

The next level of the tree defines each of the concurrent slots, 

issue slots. Each slot is an OR node supporting a set of The length of each template is variable, depending in part 

operation groups, called a super group (i.e., nodes 638, 640, on the length and number of the slots in the template. For 
642), that are all mutually exclusive and have the same eo example, some templates might have two slots, while other 

concurrency pattern. A select field chooses among the vari- templates might have three or four slots. Furthermore, the 

ous operation groups within a super group. Again, this select width of each slot will depend on the width of the widest 

field is illustrated as the leaf node labeled "steer" connected operation group within that slot, plus overhead, as shown in 

to super group 640. the lower portion of FIG. 14. ITiere is considerable similarity 

Below each super group lie operation groups as defined in 65 and overlap among the opcodes within an operation group 

the input specification as described above. Each operation by construction, so very little encoding space is wasted 

group (e.g., operation group 643) is an OR node that has a within the operation group. But the opcode field now must 
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be split into an operation group selection field 676 and ao nximber of distinct 4-issue instructions is 50^x50x50=6,250, 

opcode selection field 678 within the operation group. With 000. Specializing instructions to 1, 2, and 3-issue templates 

logarithmic encoding, this requires at most one additional bit would add many more. It is therefore necessary to impose 

for encoding the opcode. For example, 15 opcodes may be some structure on the encoding within each template, 

encoded in 4 bits, while splitting them into 3 operation 5 Our current implementation uses several mechanisms to 

groups of 5 opcodes each requires [log2(3)]+(log2(5)]='5 reduce the complexity of the problem. These mechanisms 

bits. In addition, every slot has a reserved no-op encoding. represent iformat design decisions and affect the final 

In cases where an op group has alternative operation instruction format layout and size. In most cases there may 

formats, there is yet another select field to select the opera- also be a tradeoff between the simplicity and orthogonality 

tion format. of the field layout (and hence the decode hardware) and the 

Each instruction also includes a consume to end-of-packet size of the instruction template. These tradeoffs will be 

bit 680, and a template specifier 682. The template specifier described as the design process is detailed below, 

identifies the template. An instruction format having t tem- As a first axiom, all templates must satisfy an exclusion 

plates will need [log2(t)] bits to encode the template speci- constraint between two opcodes, i.e. diese opcodes must 

fier. This template specifier is in a fixed position within every never occupy separate slots in any template. This is because 

instruction, and from its value, the instruction sequencer in these opcodes may share hardware resources during 

the processor's control path determines the overall instruc- execution, and therefore, the scheduler should never put 

tion length, and thus the address of the subsequent instruc- these opcodes together within the same instruction. On the 

tion. other hand, a concurrency constraint between two opcodes 

In the current implementation, the length of the instruc- ^° implies that the scheduler is free to issue these opcodes 

tion is variable, but each length is a multiple of a pre- together m a smgle mslruction and therefore there should be 

determined number of bits caUed a quantum. For instance, if some template in which these two opcodes are allowed to 

the quantum is 8 bits, the length of the insU^ction could be occur together. In particular, that template may contam 

any number equal to or above some minimum value (say 32 additional slots that can be fiUed with noops, if necessary, 

bits) that is divisible by 8, such as 64 bits, 72 bits, 80 bits, Therefore, it is unnecessary to generate a special template 

etc. One or more dummy bits may be placed as appropriate for each concurrency constraint, but rather all that is needed 

within the instruction to ensure that the length of the is a set of templates that can effectively cover aU possible 

instruction falls on a quantum boundary. sets of concurrently scheduled opcodes. 

The iformat system buUds the levels of the if-tree in an 30 The problem becomes greatly simplified when the con- 
incremental fashion. It constructs the top three levels, con- currency of operation groups is considered instead of indi- 
sisting of the instruction, the templates, and the super groups vidual opcodes. As introduced above, operation groups are 
from the abstract ISA specification, and optionally, custom defined as sets of opcode instances that are generally simUar 
templates. It constructs the middle layers, including the in nature in terms of their latency and connectivity to 
operation groups, the operation fonnats, and the field types 35 physical register files and are expected to be mumally 
from the abstraa ISA specification. Finally, it constructs the exclusive with respect to operation issue. AU opcodes withm 
instruction fields from the contents of the various field types an operation group must be mutually exclusive by definition, 
in the abstract ISA specification and the individual control Furthermore, the instruction format is designed so that all 
ports in the datapath that each field is supposed to control. opcodes within an operation group share the same instruc- 

tion fields. Thus, the operation group is an obvious choice 

6.7.5 Instruction Templates for the primary building block for creating templates. 

A primary objective of the instruction format design Another simplification involves classifying mutually- 
system is to produce a set of instruction templates that exclusive operation groups into equivalence classes called 
support the encoding of all of the sets of operation groups super groups based on the constraints provided in the 
that can be issued concurrently. To initiate the template 45 architecture specification. FIG. 15 illustrates an example that 
design process, the instmction format design system starts shows how the operation groups (shown as letters) and 
out with the architecture specification, which defines the exclusion relations are used in the template selection pro- 
exclusion and concurrency constraints for a particular cess. The process starts with the ILP constraints 681, which 
design. In one implementation, the architecture specification define a set of exclusion relationships 683 between operation 
directly provides the exclusion relationships between opera- 50 groups 684. From these exclusion relationships, the iformat 
tion groups. However, the iformat design process needs to design system builds a boolean exclusion matrix 686, In the 
know which opcodes can be issued concurrently, i.e., the exclusion matrix 686, the rows and columns are matched up 
concurrency relationship, rather than which opcodes must be with respective operation groups, e.g., "A" corresponds to 
exclusive. the operation group A, "B" corresponds to the operation 

In such an implementation, the concurrency relationship 55 group B, etc. The Vs in the maU-ix indicate an exclusion 

is taken to be the complement of the exclusion relationship. relationship, while a blank indicates that the corresponding 

One way of determining the concurrency relation is to lake operation groups may be issued concurrenUy. (The blanks 

the complement of the exclusion relations among opcodes are actually O's in the real matrix— blanks are used here for 

implied by the architeaure specification and treat each set of clarity). The system then builds a concurrency matrix 688 

concurrent opcodes as a potential candidate for becoming an 60 *he exclusion matrix 686. The concurrency matrix 688 

instruction template. While this provides an excellent start- is the complement of the exclusion matrix 686. The "?"s 

ing point, it unfortunately does not lead to a practical along the diagonal of the concurrency matrix 688 can be 

solution, since the number of combinations of operations interpreted as either a 1 or 0. 

that may issue concurrently quickly becomes intractable. The rows in the concurrency matrix determine a set of 

For example, a typical VLJW machine specification may 65 concurrency neighbors for each operation group. A graphical 

include 2 integer ALUs, 1 floating point ALU and 1 memory representation of the relationships defined by the concur- 

unit, with 50 opcodes each. In such a machine the total rency matrix 688 is shown in concurrency graph 692. Each 
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node represents an operation group, while each connecting 
"edge" represents a concurrency relation. A clique is a set of 
nodes from a graph where every pair of nodes is connected 
by an edge. For instance, there are 16 cliques in the 
concurrency graph 692. 

After the concurrency matrix is generated, the system 
compares the rows in the concurrency matrix to identify 
equivalent operation groups. The super groups are formed 
from the equivalent operation groups. Two operation groups 



similarly formed. The set of all distinct super groups is 
defined by all the distinct neighbor keys. This partitioning 
leads to a reduced-concurrency (super group) graph 694, 
comprising the super groups and their concurrency relations. 
Instruction templates 696 are obtained from the reduced 
concurrency graph, as described below. 

Each operation group identifies whether it is an AND -type 
or OR-type super group. This information is used in the final 
template expansion, where each operation group from an 



are said to be equivalent if they have the same set of AND-type super group is given a separate slot, while all 



concurrency neighbors. Note that two mutually exclusive 
operation groups that have the same set of concurrency 
neighbors can replace each other in any template without 
violating any exclusion constraint and therefore can be 
treated cquivalenlly. Similarly, two concurrent operation 
groups that have the same set of concurrency neighbors 
(other than themselves) can always be placed together in a 
template without violating any exclusion constraints and 
therefore can be treated equivalently. 



operation groups from an OR-type super group are put into 
the same slot. 

In the concurrency matrix 690 shown in FIG. 15, the 
entries of the "A", "B", and "C operation group bitvectors 
15 have been changed to O's so that their corresponding bitvec- 
tors are identical. Thus, "A", "B", and "C form an OR-type 
super group {A, B, C}, and each operation group is placed 
in the same slot. 

FIG. 16 shows a case with an AND-type and an OR-type 



An example of pseudocode for performing equivalence 20 super group. In order to obtain identical bitvectors, the "A", 



checking and partitioning into super groups ls illustrated 
below. 



PracedureFindSupcrGroups (BitMatrix concur) 

1: // "concur" is a (numNodcs x mimNodes) boolean matrix 

2: //Fiisl, initialize supergroup hash table and id counter 

3: HashMapiBit Vector, int^ SGmap 

4: int Sgcount = 0; 

5: for (i a 0 to num Nodes- 1) do 

6: //extract each node's vector of neighbors w/ and w/o self 

7: Bit Vector AND- group = concur.row(i) .set_bit(i); 

8: Bit Vector OR- group = concur.row(i) .reset_bit(i); 

9: //Check for existiag AND-style supergroup for this node 

10: if (SGmap (AND-group) is aiready boxind) then 

11: SGkind(i) - SG-AND; 

12: SGid(i) - SGmap (AND- group); 

13: //Check for existing OR-style supergroup for this node 

14: else if (SGmap(OR-group) is already bound) then 

15: SGkind{i) - SG-OR 

16: SGid(i) - SGinap(OR-qioup); 

17: //If neither neighbor relation is present, start a new 

18: //supergroup with the new neighbor relations 

19; else 

20; SGid(i) = SGcount; 

21: SGmap (AND-group) - SGmap {OR -group) = SGcount; 

22: SGccrunt - SGcount + 1; 

23: cndif 

24: cndfor 



"B", and "C" operation groups are treated as being concur- 
rent with themselves. As a result, they form an AND-type 
super group and are placed in separate template slots. In 
contrast, the "M", "N", "X", and "Y" operation groups are 
25 treated as exclusive with themselves and form two different 
sets of OR-type super groups {M JV} and {X,Y}, which each 
occupy a single slot. 

For a homogenous VLIW-style machine with multiple, 
orthogonal functional units this process yields tremendous 
30 savings by reducing the complexity of the problem to just a 
few independent super groups. Hie resulting instruction 
templates closely match super groups to independent issue 
slots for each ftinctional unit. For a more heterogeneous 
machine with shared resources, the resulting number of 
35 templates may be larger and the decoding is more complex 
but partitioning the operation groups into super groups still 
reduces the complexity of the problem significantly. 

6.7.6 Concurrency Cliques and Templates 

40 Once the super groups have been determined, each clique 
in the reduced concurrency graph is a candidate for an 
instruction template since it denotes a set of super groups 
that may be issued in parallel by the scheduler. A clique is 
a subgraph in which every node is a neighbor of every other 
45 node. Clearly, enumerating all cliques would lead to a large 
number of templates. On the other hand, unless the concur- 
rency among super groups is restricted in some other way, 
it is necessary to choose a set of templates that cover all 
possible cliques of the super group graph to ensure that the 



The equivalence check and the partitioning can be per- 
formed quickly by employing the pigeon-hole principle. The 

algorithin hashes each operation group using its set of so scheduler is not restricted in any way other than that 



neighbors in the concurrency matrix as the key. The neigh- 
bor relations (neighbor keys) for each operation group (each 
row) are converted to bitvectors. The algorithm hashes in 
two ways: once by treating each operation group as concur- 
rent with itself (AND-style) thereby finding equivalent con- 
current operation groups, and the second time by treating 
each operation group as exclusive with itself (OR-style) 
thereby finding equivalent exclasive operation groups. This 
hashing approach results in two bitvectors for each operation 



specified in the ArchSpec. 

As an example, suppose super groups A, B and C only 
have pairwise concurrency constraints, i.e., {AB}, {AC}, 
and {BC}. These pairwise concurrencies can be covered in 
55 one of two ways. First, the pairwise concurrency constraints 
can be treated as three independent templates AB, AC, and 
BC, each requiring two issue slots. A second possibility is to 
treat the pairwise concurrencies as being simultaneously 



concurrent, thereby requiring only one template (ABC) with 
group — one with the entry changed to a 1 (AND-style), 6o three issue slots. Strictly speaking, this allows more paral- 
and one with the entry changed to a 0 (OR-style). lelism than what was intended. If the compiler never .sched- 

Bitvcctors (operation groups) that hash to the same bucket uled all three operations simultaneously, the second design 
necessarily have the same concurrency neighbors and there- would end up carrying one noop in every instruction thereby 
fore become part of the same super group. For example in wasting one-third of the program space. On the other hand, 
FIG. 15, operation groups A, B, and C have the same 65 the first design requires additional decoding logic to select 
concurrency neighbors and thus form the super group {A, B, among the three templates and more complex dispersal of 
C}. The other super groups, {P, Q}, {X, Y}, and {M, N}, are the instruction bits to the various functional units. 
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In the present scheme, this tradeoff is made towards 
initially choosing a reduced number of possibly longer 
templates. This is partly due to the fact that the ArchSpec 
docs not directly specify concurrency in most instances, but 
rather specifies exclusion relations among operation groups 
that arc then complemented to obtain concurrency relations. 
During the initial template design phase, choosing the maxi- 
mally concurrent templates covers all possible concurrency 
relations with as few templates as possible. 

The maximally concurrent templates may be determined 
by finding the cliques of the super group graph. An example 
of a simple reduced super group concurrency graph is shown 
in FIG. 17. The graph comprises super groups 1-7, and their 
interconnecting edges. The maximal cliques for such a 
simple graph can be determined by hand by simply identi- 
fying sets of nodes that are completely connected— that is 
each node in a clique must connect to the remaining nodes 
in the clique. For instance, {1, 3, 7} is a clique, while {2, 4, 
5, 6} is not (nodes 5 and 6 are not conneaed). In the 
supetgraph of FIG. 6, there are seven maximal chqucs, and 
thus seven maximally concurrent templates. 

It is necessary to use computational means to calculate the 
cliques for more complex super group graphs. The instruc- 
tion format designer uses the same approach for finding 
cliques as the datapath synthesizer described above. 

6.7.7 Set-Up of Bit Allocation Problem 

Once the templates are selected, the iformat system con- 
structs the lower levels of the IF tree. The templates form the 
upper level of the tree. For each of the operation groups in 
a template, the system extracts the inputs and outputs for 
each operation based on their I/O formats in the abstract ISA 
specification and adds this information to the IF tree. Using 
the extracted I/O formats, the system enumerates the instruc- 
tion fields for each of the operation groups associated with 
the templates. Next, it builds field conflicts, partitions 
instruction fields into superfields, and extracts bit width 
requirements. 

6.7.7.1 Instruction Fields 

As shown in FIG. 13, the instruction fields form the leaves 
of the if-tree. Each instruction field corresponds to a data- 
path control port such as register file read/write address 
ports, predicate and opcode ports of functional units, and 
selector ports of multiplexors. Each field reserves a certain 
number of instruction bits to control the corresponding 
control port. 

The iformat designer assigns each field to a control port 
by traversing the if tree to find the operation group associ- 
ated with the field, and then extracting the functional unit 
assigned to the operation group in the datapath specification. 

The following sub -sections describe various kinds of 
instruction fields. FIG. 20 is annotated with letters S, A, L, 
op and C to illustrate examples of the information flowing 
from these fields in the instruction register to the control 
ports in the data path. 
Select Fields (S) 

At each level of the if- tree that is an OR node, there is a 
select field that chooses among the various alternatives. The 
number of alternatives is given by the number of children, 
n, of the OR node in the if-trcc. Assuming a simple binary 
encoding, the bit requirement of the select field is then 
log2(n) bits. 

Different select fields are used to control different aspects 
of the datapath. The root of the if-tree has a template select 
field that is routed directly to the instruction imit control 
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logic in order to determine the template width. It also 
specifies where the supergroup select fields are positioned. 
Therefore, this field must be allocated at a fixed position 
within the instruction. Together with the template select 

s fields, the select fields at super group and operation group 
levels determine how to interpret the remaining bits of the 
template and therefore are routed to the instruction decode 
logic for the datapath. The select fields at the level of field 
types (10 sets) are used to control the multiplexors and 

10 tristatc drivers at the input and output ports of the individual 
functional units to which that operation group is mapped. 
These fields select among the various register and literal file 
alternatives for each source or destination operand. 
Register Address Fields (A) 

15 The read/write ports of various register files in the data- 
path need to be provided address bits to select the register to 
be read or written. The number of bits needed for these fields 
depends on the number of registers in the corresponding 
register file. 

20 Literal Fields (L) 

Some operation fonnats specify an immediate Uteral 
operand that is encoded within the instruction. The width of 
these literals is specified externally in the ArchSpec. Dense 
ranges of integer literals may be represented directly within 

25 the literal field, for example, an integer range of -512 to 511 
requires a 10-bit literal field in 2's complement representa- 
tion. On the other hand, a few individual program constants, 
such as 3.14159, may be encoded in a ROM or a PLA table 
whose address encoding is then provided in the literal field. 

30 In either case, the exact set of literals and their encodings 
must be specified in the ArchSpec. 
Opcode Fields (op) 

The opcode field bits are used to provide the opcodes to 
the functional unit to which an operation group is assigned. 

35 It is possible to use the internal hardware encoding of 
opcodes in the functional unit directly as the encoding of the 
opcode field, in which case the width of the opcode field is 
the same as the width of the opcode port of the correspond- 
ing functional unit and the bits are steered directly to it. This 

40 mechanism may be used when all the opcodes supported by 
a functional unit are present in the same operation group or 
the same super group. 

Under some templates, however, the functional unit 
assigned to a given operation group may have many more 

45 opcodes than those present within the operation group. In 
this case, opcode field bits may be saved by encoding the 
hardware opcodes in a smaller set of bits determined by the 
number of opcodes in that operation group an d th en 
decoding these bits before supplying to the functional unit. 

50 In this case, the template and opgroup specifier bits are used 
to provide the context for the decoding logic. 
Miscellaneous Control Fields (C) 

Some additional control fields are present at the instruc- 
tion level that help in proper sequencing of instructions. 

55 These consists of the consume to end-of-packet bit (Eop) 
and the field that encodes the number of no-op cycles 
following the current instruction. 

6.7-7.2 Computing Field Conflicts 

60 Before performing graph coloring, the system computes 
the pairwise conflict relation between instruction fields, 
which are represented as an undirected conflict graph. 

In the if-tree, two leaf nodes (instruction fields) conflict if 
and only if their least-common ancestor is an AND node. 

65 The system computes pairwise conflict relations using a 
bottom-up data flow analysis of the if-tree. The procedure in 
the implementation maintains a field set, F, and a conflict 
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relation, R. Set F„ is the set of instruction fields in the subtree 
rooted at node n. Relation R„ is the conflict relation for the 
subtree rooted at node n. 

The procedure processes nodes in bottom-up order as 
follows: 
Leaf Node 

At a leaf node, /, the field set is initialized to contain the 
leaf node, and the conflict relation is empty. 
Or-nodc 

At an OR-node, the field set is the union of field sets for 
the node's children. Since an OR-node creates no new 
conflicts between fields, the conflict set is the union of 
conflict sets for the node's children. 
And-node 

At an AND -node, the field set is the union of field sets for 
the node's children. An AND -node creates a new conflict 
between any pair of fields for which this node is the 
least -common ancestor; i.e. there is a new conflict between 
any two fields that come from distinct subtrees of the 
AND-oode. Formally, ^ 

Ch= IJ CiU{(x.y)leCj,yeC,Jtk) 

This method can be implemented very efiSciently, by 
noting that the sets can be implemented as linked lists. 
Because the field sets are guaranteed to be disjoint, each 
union can be performed in constant time by simply linking 
the children's lists (each union is charged to the child). 
Similarly, the initial union of children's conflict sets can be 
done in constant time (charged to each child). Finally, 
forming the cross-product conflicts between fields of distinct 
and-node children can be done in time proportional to the 
number of conflicts. Since each conflict is considered only 
once, the total cost is equal to the total number of conflicts, 
which is at most n^. For an if-tree with n nodes and E 
conflicts, the overall complexity is 0(n+E) time. 

6.7.7.3 Assigning Field Affinities 

As introduced above, the iformat system is capable of 
aligning instruction fields that correspond to the same con- 
trol port to the same bit position in a process caUed affinity 
allocation. Such aUgnment may simplify the multiplexing 
and decoding logic required to control the corresponding 
datapath control ports since the same instruction bits are 
used under different templates. On the other hand, such 
alignment may waste some bits in the template thereby 
increasing its width. 

In order to make use of affinity allocation, the iformat 
designer groups instruction fields that point to the same 
datapath control port into a superfield. All instruction fields 
within a superfield are guaranteed not to conflict with each 
other since they use the same hardware resource and there- 
fore must be mutually exclusive. 

The superfield partitioning only identifies instruction 
fields that should preferably share instruction bits. However, 
sometimes it is deemed essential that certain instruction 
fields must share the same bits. For example, if the address 
bits of a register read port are aligned to the same bit 
positions under all templates, then these address bits may be 
steered directly from ttie instruction register to the register 
file without requiring any control logic to select the right set 
of bits. This forced sharing of bit positions can avoid the 
need for a multiplexor in the critical path of reading oper- 
ands out of a register file, thereby enhancing performance. 
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To handle such a constraint, the iformat system allows a 
user or other program module to specify a subset of fields 
within a superfield that must share bits. One way to specify 
this is in the form of a level mask that identifies the levels 
5 of the if-tree below which aU instruction fields that are in the 
same superfield must share bit positions. This mask is a 
parameter to the bit allocation process described in the next 
section. 

10 6.7.8 Resource Allocation 

Once the instruction fields have been assigned to the 
leaves and the pairwise conflicts have been determined, we 
are ready to begin allocating bit positions to the instruction 
fields. In this problem, instruction fields are thought of as 
resource requesters. Bit positions in the instruction format 
are resources, which may be reused by mutuaUy exclusive 
instruction fields. Fields required concurrently in an instruc- 
tion must be allocated different bit positions, and are said to 
conflict. The resource allocation problem is to assign 

^ resources to requestors using a minimum number of 
resources, while guaranteeing that conflicting requestors are 
assigned different resources. The current implementation of 
resource allocation uses a variation of graph coloring. 

^ Once the if-trce and instruction field conflict graph are 
built, the iformat system can aUocate bit positions in the 
instruction format to instruction fields. Pseudocode for the 
resource allocation is shown below: 

ResourceAlloc(oodeRequestS5 conflictGraph) 

// compute resource request for each node+neighbors 
foreach (node e conffictGraph) 
Mark (node)=FALSE; 

TotalRequest(nodc)=Request (node)+Requcst 
35 (NeighborsOf (node)); 

// sort nodes by increasing remaining total resource 
request 

// compute upper-bound on resources needed by alloca- 
tion 

resNeeded-0; Stack-EMPTY; 

for (k from 0 to NumNodes(conflictGraph)) 

find (minNode e unmarked nodes) such that 
TotalRequest(minNode) is minimum; 

Mark(minNode)=TRUE; 

push(minNode,Stack); 

resNeeded=max(resNeeded, TotalRequest(minNode)); 
foreach (nhbr e NeighborsOf(minNode)) 
TotalRequest(nhbr)-=Request(minNode); 

// process nodes in reverse order (Le., decreasing total 

request) 
while (Stack is not EMPTY) 
55 node=pop(Stack); 

AJlResources={0. . . resNeeded-l); 
// available bits are those not already allocated to any 
neighbor 

AvailableRes(node)=AllResources-AllocatedRes 
60 (NeighborsOf(node)); 

// select requested number of bits from available posi- 
tions 

// according to one of several heuristics 
65 AUocatedRes(node)-Choose Request(node) resources 
from 

AvailableRes(node) 
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[S HI: Contiguous Allocation 
O H2: AfBnity Allocalion 
return resNeeded 

In the above pseudocode, the total resource request for a 
node and its neighbors is computed by the first loop. The 
heuristic repeatedly reduces the graph by eliminating the 
node with the current lowest total resource request (node 
plus remaining neighbors). At each reduction step, we keep 
track of the worst-case resource limit needed to extend the 
coloring. If the minimum total resources required exceed the 
current value of k, we increase k so that the reduction 
process can continue. The graph reduction is performed by 
the second loop. Nodes are pushed onto a stack as they are 
removed from the graph. Once the graph is reduced to a 
single node, we begin allocating bit positions (resources) to 
nodes. Nodes are processed in stack order, i.e. reverse 
reduction order. At each step, a node is popped from the 
stack and added to the current conflict graph so that it 
conflicts with any neighbor from the original graph that is 
present in the current conflict graph. The existing allocation 
is extended by assigning bit positions to satisfy the current 
node's request, using bit positions disjoint from bit positions 
assigned to the current node's neighbors. 

6.7.8.1 Allocation Heuristics 

During bit allocation, the current node's request can be 
satisfied using any bit positions disjoint from positions 
allocated to the node's neighbors in the current conflict 
graph. The current implementation applies several heuristics 
to guide the selection of bits. 
Left-most Allocation 

The number of required bit positions computed during 
graph reduction is the number needed to guarantee an 
allocation. In practice, the final allocation often xises fewer 
bits. By allocating requested bits using the left-most avail- 
able positions, we can often achieve a shorter instruction 
format. 

Contiguous Allocation 

Since bit positions requested by an instruction field gen- 
erally flow to a common control point in the data path, we 
can simplify the interconnect layout by allocating requested 
bits to contiguous positions. 
A£5nity Allocation 

Non-conflicting instruction fields may have affinity, 
meaning there is an advantage to assigning them the same bit 
positions. For example, consider two non-conflicting fields 
that map to the same register file read address port. By 
assigning a single set of bit positions to the two fields, we 
reduce the interconnect complexity and avoid muxing at the 
read address port. As discussed earlier, each node has a set 
of affinity siblings. During allocation, we attempt to allocate 
the same bit positions to affinity siblings. This heuristic 
works as follows. When a node is first allocated, its alloca- 
tion is also tentatively assigned to the node's affinity sib- 
lings. When a tentatively allocated node is processed, we 
make the tentative allocation permanent provided it does not 
conflict with the node's neighbors' allocations. If the tenta- 
tive allocation fails, we allocate available bits to the current 
node using the previous heuristics, and we then attempt to 
re- allocate all previously allocated affinity siblings to make 
use of the current node's allocated bits. Because nodes are 
processed in decreasing order of conflict, tentative alloca- 
tions often succeed. 

A heuristics diagram for the resource allocation is as 
follows: 



if node is tentatively allocated then 

make tentative allocation permanent, if possible 
if node is (still) not allocated then 
^ try to use a sibling allocation 
if node is (still) not albcated then { 

allocate either contiguously, or left-most available 
for each sibling of node { 
if sibling is allocated then 

try to use node's allocation in place of existing 
allocation 

else 

tentatively allocate sibling, using node's allocation 

6.7.9 Template-based Assembly 

Once the complete structure of the instruction templates 

2Q has been determined, we can proceed to assemble the code. 
All subsequent discussion is essentially to improve the 
quality of the templates. In this section, we briefly outline 
the process of assembly with a given set of templates. 
A program that has been scheduled and register-allocated 

25 consists of a sequence of operations each of which has been 
assigned a time of issue. Multiple operations scheduled 
within the same cycle need to be assembled into a single 
instruction. Any instruction template that covers all the 
operations of an instruction may be used to assemble that 

30 instruction. Clearly, the shortest template is preferred to 
avoid increasing the codesize unnecessarily since longer 
templates would have to be filled with noops in the slots for 
which there are no operations in the current instruction. 
The process of template selection for an instruction has 

35 the following steps. First, the specific compiler-opcode of 
each scheduled operation in the instruction is mapped back 
to its operation group. Each operation group keeps a record 
of the set of templates that it can be a part of. Finally, the 
intersection of all such sets coaesponding to the operation 

40 groups present in the current instruction gives the set of 
templates that may be used to encode the current instmction. 
The shortest template from this set is chosen for assembly. 
The exact opcode and register bits are determined by map- 
ping the compiler mnemonics to their machine encodings by 

45 consulting the if-trec. 

6.7.10 Design of Applicalion^specific Instruction 
Formats 

As discussed above, the initial design produces a minimal 
50 set of maximally concurrent instruction templates that cover 
all possible concunency relations implied by the ArchSpec. 
In practice, this tends to produce a few long templates since 
the processor designs we are interested in have quite a bit of 
expressible in struction-level parallelism (ILP ). But not all 
55 that parallelisin is used at all times by the scheduler. If we 
assemble programs using only these long templates, a lot of 
noops would have to be inserted in the low ILP parts of the 
code. 

One fix to this problem is to customize the templates to 
60 the program being compiled. There are several aspects to 
such customization: 

(1) Identify the most frequentiy used combinations of 
operations in the program and design shorter templates for 
them which allow fewer concurrent operations in them. An 
65 extension of this view also takes into account the most 
frequently used operation formats and creates new opgroups 
that incorporate just those. 
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(2) Use variable length encoding wherever there is a need ultimately returns a set of custom templates that meet a 
to select one out of many choices in the instruction format. predetermined optimization criteria (710, 712). As noted 
We may use variable length template selection bits accord- above, the criteria may include, for example, a minimized 
ing to the frequency of use of each template. Likewise, static or dynamic code size or a minimized code size and 
different operation groups within a slot and different s decode complexity. An example of this criteria is discussed 
opcodes within an operation group may be given a variable below, 

length encoding according to their frequency of use. There [□ the current implementation, the problem of determin- 

is, of course, a tradeoff between the codesize reduction and jug custom templates is formulated as follows. Let us 

the increase in decode complexity. assume that Tj, , T„ are the instruction templates that 

(3) Sometimes, the decode complexity may be improved are required to conform with the ArchSpec. Suppose 
dramatically by doing affinity -based allocation of similar C^, . . . , are distinct combinations of operation groups 
instruction fields across templates. This reduces the degree occurring in the program. Let the width of each combination 
of multiplexing needed to route the same information rep- be w^. and its frequency of occurrence be f,-. Also, in case of 
resented at different positions in different templates. Tliis unoptimized assembly, suppose each combination Q maps 
amounts to reordering the positions of various operation 1 5 to an initial template T, with width v^. Assuming that 
groups within these templates. variable length encoding is not used for the template selec- 

(4) The instruction fete* and decode hardware is usually tion field, the initial size of the program is, 
designed with a certain quantum of instruction information 

in mind. A quantum is a unit of data (e.g., an integer multiple A 

of bytes) used to specify the width of the data path in the ^ vv = ^ / • (w + [lo^n]) 

instruction fetch and decode hardware. Rounding the 

instruction templates up to the next quantum usually frees up 

extra bit space. One or more of the above strategies can then Now suppose we include as a custom template. This is 

take advantage of this extra bit space without increasing the taken to be in addition to the initial set of templates since 

width of the instruction. those must be retained to cover other possible concurrency 

6,7,11 Schedule-based Template Customization relations of the machine as specified in the ArchSpec. The 

The instruction format information is not needed until the additional template has a smaller width w^ but it increases 

program is ready to be assembled. The compiler Ls driven by the size of the template selection field (and hence the decode 

a machine-description that only depends on the specified logic). The other significant increase in decode cost is due to 

ArchSpec and the structure of the datapath. This implies that the fact that now the same operation may be represented in 

the exact schedule of die program may be used to customize two different ways in the instruction format and hence the 

the various available templates. To customize templates for instruction bits from these two positions would have to be 

a particular application program, the iformat system uses multiplexed based on the template selected. This cost may 

operation issue statistics from a scheduled version of the be partially or completely reduced by performing affinity 

program to determine the frequency of use of the various allocation as discussed above. 

combinations of operations. It then selects frequently used If X,- represents a 1/0 variable denoting whether combi- 

combi nations of operations as possible candidates for new nation is included or not, the optimized length of the 

templates. Finally, it performs a cost/benefit analysis to program is denoted by, 
select new "custom" templates. 

FIG. 18 is a flow diagram illustrating a process of V / n f V 

selecting custom templates from operation issue statistics. w,^ ^ l^fi [Xi-»'i+a-Xi)-Vi + \\os^[n + 2^ Xi]\) 
The process begins by extracting usage statistics from a 

scheduled application program 700. This is done by map- ^S]fr (v, - ■ (v/ - + flog2(rt + T Xi)]) 

ping the scheduled opcodes of an instruction back to their 45 ,-1 
operation groups as shown in step 702. The process then 
generates a histogram of combinations of operation groups 

from the program as shown in step 704. It is clear that we should customize all those operation 

A static histogram records the frequency of static occur- S^oup combinations into additional templates that provide 

rences of each combination within the program and may be 50 ^^^^^^^ weighted benefit until the cost of encoding 

used to optimize the static codesize. A dynamic histogram additional templates and their decoding cost outweigh the 

weights each operation group combinadon with its dynamic total benefits. One possible strategy is to pick the k most 

execution frequency and may be used to improve the beneficial combinations where k is a small fixed number 

instruction cache performance by giving preference to the *^<16). The decode complexity directly impacts chip 

most frcquendy executed sections of the code. One imple- ss ""^^^^ ^^'^^^ ^'^^ ^"^^^^^ ^ 

mentaUon uses the static histogram in the optimization to of templates, the complexity of the decode logic tends to 

give preference to the overaU static code size. In alternative 8^^^^ ^"^^^s affinity constraints are used to align operation 

implementations, the dynamic histogram or both the gi'o^P occurrences from different templates to the same 

dynamic and static histograms may be used to optimize the template slots. The chip area occupied by selection logic 
dynamic code size of the combined dynamic/static code 60 may be quanUfied as another component of the cost fiinction. 
size, respectively. 

Based on the frequency of use data in the histogram, the 

customization process selects combinations of opgroups as Variable length field encoding is an important technique 

potential candidates for templates (706) and evaluates their for reducing the overall instruction format bit length, llie 
cost/benefit (708) in terms of code size/decode complexity, 65 simplest use of variable length fields is in encoding a 

which is quantified in a cost function. The process iteratively steering field that selects one of a set of exclusive fields of 

selects a set of templates, evaluates their cost/benefit, and differing lengths. For example, the instruction formats have 



6.7.12 Variable Length Field Encodings 
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an opgroup steering field to select one of many opgroups datapath specification and an abstract ISA specification. In 

available within a single issue slot. Suppose we have 32 an alternative design scenario, the iformat design process 

opgroups available within a particular issue slot, and that the may be used to generate optimized concrete ISA specifica- 

opgroups' encodings require lengths from 12 to 29 bits. With ^^j^ programmatically from an initial concrete ISA spccifi- 
fixed-length encodings, we require an additional 5 bi^ to ^ ^^^^^^ ^ frequentiy occurring combinations of 

encode the operoup selection, bnngmg the overall size or the j ir n ♦ • * tt,^ „ i 

issue slot to 34 bi^^. Using a varilble-length encoding, we operaUon^oup occurrences and ILP constraints TTic mitia 

can allocate short encoding to opgroups having the greatest concrete ISA speaficaUon includes an instruction format 

overall width, while using longer encodings for opgroups speaficaUon and a register file spccificaUon and mappmg. 

having smaller width. Provided there is enough "slack" in The register file specification and mapping provides: 1) the 
the shorter opgroups to accommodate longer encodings, the lO register file types; 2) the number of registers in each file; and 

overall bit requirement can be reduced significantly. In our 3) a correspondence between each type of operand instruc- 

example, we may be able to achieve a 30 bit encoding for the tion field in the insUiiction format and a register file, 
issue slot. , . . , , J. In order to optimize the instruction fonn at in this scenario 

One approach to designing vanable-length encodrngs specification), the ifonnat 

uses entropy coding, and in particular, a variant of Huffman ^ . ■' u „ . u , . ioa 

encoding. Entropy coding ik a coding technique typically P^^'^^ pro^ammaUcally extracts an abstract ISA 

used for data compression where an input symbol of some specificaUon from the concrete ISA specification (see step 

input length in bits is converted to a variable length code, 554 m FIG. 12), It then proceeds to generate the bit 

with potentially a different length depending on the fre- allocation problem specification, and allocate bit positions 

quency of occurrence of the input symbol. Entropy coding programmatically as explained in detail above. The opera- 
assigns shorter codes to symbols that occur more frequently ^ tion group occurrences and ILP constraints (e.g., concur- 

and assigns longer codes to less frequent codes such that the rency sets of the operation group occurrences) may be 

total space consumed of the coded symbols is less than that provided as input from the user (e.g., starting with a custom 

of the input symbols. template specification at block 556 in FIG. 12), or may be 

Let F be a set of exclusive bit fields, and let w^. denote the generated programmatically from operation issue statistics 
bit length of field i e F. An encoding for the steering field for 25 568 in step 569 shown in FIG. 12 and described above. 
F is represented as a labeled binary tree, where each element Gi\Gn a Concrete ISA Specification, this step extracts the 

of F is a tree leaf. The edge labels (zero or one) on the path information corresponding to an Abstract IS A Specification, 
from the root to a leaf i denotes the binar>' code for selecting Instruction Format, which is part of the Concrete ISA 

i. A fixed-length steering code is represeiited by a balanced specification, consists of a set on Instruction Templates, 
tree in which every leaf is at the same deptk Vanable-length 30 ^^^^^ -^^^ ^^^^,1 ^^^^^^^^ ^^^es 

encodings are represented by asymmetric trees^ ^^^^^ .^^^^ ^ ^^^^^ P^^^ ^ information one can 

For a tree T representmg a code for F, we define d^x) to ^^^^ corresponding Operation Group Occurrences and 

be the depth of x, i.e., the code length for choice x. The total „ S ♦ „ vf* f r*™.^^*;^^ i-rv,i,« 

cost of encoding a choice x is the sum of the bit requirement > Concurrency Set consistmg o these Operation Group 

for X and the code length for x: 35 ^T'T'^' ^ ■ ^"^TT T''"^^''''' '°^f^J' 

define the opcode repertoire, the Operation Groups and the 

cosij(x)^j{x)+w(x) jLp specification that form part of the Abstract ISA Sped- 

The overall cost for encoding the set of fields F together fication. The Instruction Format Specification directly pro- 

with its steering field is equal to the worst-case single field vides the I/O Format for each opcode as needed by the 

cost: Abstract ISA Specification. The Register File Specification 

^° in the Concrete ISA Specification directly provides the 

C(T) = maxicos/rW) Register File Specification that completes the Abstract ISA 

'^•^ Specification. 

The goal is to find a code T of minimal cost. Hiis problem 6.8 Overview of Control Path Design System 

is solved by the algorithm shown below: Jhe control path design system is a programmatic system 

that extracts values for control path parameters from an 
instruction format and data path specification and creates a 
control path specification in a hardware description 
language, such as AIR. 

FIG. 19 is a block diagram illustrating a general overview 
of the control path design system. The inputs to the control 
path design synthesizer (CP synthesizer) 800 include a data 
path specification 802, an instruction format specification 
55 804, and ICache parameters 806. The CP synthesizer selects 
the hardware components for the control path design from a 
macrocell database 808 that includes generic macrocclls for 
a sequencer, registers, multiplexors, wiring buses, etc. in 
AIR format. The macrocell database also includes a machine 
60 description of certain macrocells, referred to as mini MDES. 
The mini-mdes of a functional unit macrocell, for example, 
includes the functional unit opcode repertoire (i.e., the 
opcodes executable by the functional unit and their binary 
6.7.13 Extracting an Abstract ISA Specification encoding), a latency specification, internal resource usage, 

from a Concrete ISA Specification 55 and input/output port usage. 

As outlined above, the iformat design process may be Implemented as a set of program routines, the CP syn- 
used to generate an instruction format specification from a thesizer extracts parameters from the data path, the instruc- 



Huflfman (Set C, Weights W) 


1: 


N»|C|: 


2: 


//insert elements of C into priority queue 


3: 


for x^C do 


4: 


enqueue (x, Q); 


5: 


endif 


6: 


for i = 1 to n-1 do 


7: 


z - new node; 


8: 


X - cxtract__min (Q); 


9: 


y - extract_niin (Q); 


10: 


z-left • x; z.right = y; 


11: 


W(z) - max {W(x), W(y)} +1; 


12: 


enqueue (z,Q); 


13: 


endif 


14: 


return extra ct_mm (Q); 
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tion formal, and instruction cache specifications and syn- control ports in the data path are enumerated, but are not 

thesizes the control path including the lUdatapath, control connected to other components. For example, the opcode 

logic for controlling the RJdalapath, and decode logic for input of the functional units and the address inputs of the 

decoding the instructions in the instruction register. register files are cniimeratcd, but are not connected to the 

The CP synthesizer builds the RJdatapath based on the 5 control path hardware, 

instruction width requirements extracted from the inslruc- FIG. 20 illustrates an example of a processor design, 

tion format specification. It instantiates macrocells in the showing the relationship between the data path (in dashed 

lUdalapath by computing their parameters from the maxi- box 820) and the control path. The data path includes a 

mum and minimum instruction sizes and the instruction register file instance, gpr, a functional unit (FU) cell 

cache access time. instance, and an interconnect between the gpr and functional 

It then constructs the control logic for controlling the ^nit. The interconnect comprises data buses 822-830 that 
lUdatapath based on the computed lUdatapath parameters ^^^ry data between the FU and gpr, a multiplexor 832 that 
and the ICache parameters. The ICache parameters provide selects between input sources (e.g., gpr and Uteral pseudo- 
basic informaUon about the instruction cache needed to register Sext), and tri-state buffer 834 that drives output data 
construct the instruction fetdi logic. These parameters from the FU onto a data bus 830. The data read ports of the 
include the cache access time and the width of the instnic- gpr* and drl, provide data to the data input ports of the 
tion packet, which is the unit of cache access. F^, iO and il, via buses 82^-828 and multiplexor 832. The 

™ * 1 J • *u • *u J J output port of the FU, oO, provides data to the data write 

The control path design process synthesizes the decode ^ ^ • . • . / u «• t>-»^ ^ ^ , u o^n 

, . ^ , . • . - .I. • * port, dwO, via tn-state buffer 834 and data bus 830. 
logic for decoding the instruction m the instruction register 

by scanning the instruction format and data path control P^^s that are enumerated, yet remam uncon- 

ports. It also determines the interconnect between the bit n^^^ed before the control path design, mclude the read and 

positions in the instruction register and the control ports in ^^te address ports of the gpr, arO, arl and awO, and the 

the data path opcode mput port, op, of the FU. Some data ports m a FU 

™ ™ \. . . , 4' •„ ,u - or gpr may map to more than one data port in the gpr or FU, 

The CP synthesizer is programmed to optimize the design nc ^- t rX. . t_ ■ u * n j • 4 1 

? J * - J * 1 *u respectively. This shanng may be controlled via control 

of the mstruction umt for a pre-determmed control path \ „ n- t o^? . • » . u «• q^a 

protocol. As part of this process, it may optimize the P^^^ ^ multiplexor 832 or tn-state buffer 834. 

instruction pipeline (the lUdatapath) by selecting macrocells ^ P^^ of the gpr or FU may map to more 

that achieve a desired instruction issue rate, such as one Position m the mstruction. This type of shanng 
instniction to the decode logic per cycle, and by minimizing 30 "^^y ^^^^^^^^^^ via control ports of a multiplexor 836, for 

the area occupied by the macrocells. It also minimizes the example However, the hardware logic to control this shar- 

area of the control logic, such as the area that the lU control "^S left to be specified m the control path design process, 

logic and decode logic occupies. The mapping between the instruction fields in an instruc- 

nie output of the control path design process is a data the control ports in the data path is specified in the 
structure that specifies the control path hardware design in 35 instnicUon format specification. The datapath specification 

the AIR format 810. The AIR representation of the lUdata- enumerates the control ports m the data path and provides 

path includes the macroceUs for each of the components in the mfonnation needed to map these control ports to the 

the lUdalapth. This may include, for example, a prefetch instniction fields. The instniction fonnat specification spea- 

buffer for covering the latency of sequential instniction ^^s the specific bit positions and encodings of the fields in 

fetching, and other registers used to store instnictions before 40 instniction fields. 

issuing them to the decode logic. The AIR representation The following sections describe in more detail how an 

includes a macrocell representing the sequencer and the implementation of the control path design process generates 

control logic specification (e.g., a synlhesizable behavioral the control path. 

description, control logic tables, etc.) representing the con- ^ g ^ ^he Control Path Protocol 
trol logic for each of the components in the lUdatapath. 45 

FinaUy, the AIR representation includes a decode logic The control path design process synthesizes a specific 

specification (e.g., decode logic tables) representing the control path design based on a predefined control path 

instruction decode logic and the interconnection of this protocol. In the current implementation, the control path 

decode logic between the instruction register and the control protocol defines a method for fetching instructions from an 
ports enumerated in the data path specification. Conven- 50 instruction cache and dispatching them sequentially to an 

tional synthesis tools may be used to generate the physical instruction register that interfaces with the processor's 

logic (such as a PLA, ROM or discrete logic gates) firom the decode logic. It also defines the type of macrocells that the 

control and decode logic specifications. control path will be constructed from and enumerates their 

parameters. The CP synthesizer program then selects the 

6.8.1 The Relationship Between the Control Path macroceUs and computes specific values for their parameters 

and the Control Ports in the Data Path based on information extracted from the instruaion format 

Before describing aspects of the control path in more datapath, 

detail, it is instructive to consider the state of the processor The example in FIG. 20 helps to illustrate the control path 

design before the CP synthesizer is executed. As noted protocol used in the current implementation. It is important 
above, one input of the control path design process is the 60 to note that a number of design choices are made in defining 

data path specification. Provided in the AIR format, the data the protocol, and these design choices wiU vary with the 

path input 802 specifies instances of the functional unit implementation. The illustrated protocol represents only one 

macrocells and register file macrocells in the data path. It possible example. 

also specifies instances of the macrocells representing the To get a general understanding of the control path 
wiring that interconnects the read/write data ports of the 65 protocol, consider the flow of an instruction through the 

register files with input and output data ports of the func- control path in FIG. 20, The sequencer 900 initiates the 

tional units. At this phase in the design of the processor, the fetching of instructions into the lUdatapath. The MAR 902 
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in ihe sequencer stores the address of the next instruction to time taken in cycles between the point when an address is 

be fetched from the instruction cache 904. Using the con- presented to the address port of the ICache and when the 

tents of the MAR, the sequencer initiates the fetching of corresponding data is available on its data port for reading, 

instructions from the cache for both a sequential mode and The cache line size parameter defines the width of a cache 

a brandi mode s ^ quanta. The control path design process selects a 

In order lo specify values for the widths of components in 'r'^^^ ^^"^ ^ f^'f °' "l^"? '° ^^^"^ 

the lUdatapath, the CP synthesizer extracts information '''Pf.f^ f * f'""''" °^ " P°^" °f 

. , . . -jL-r L . two. Although not necessary, this impucs that m our current 

about the mstruction widths from the mstruction formal , , fr . ^' . ^ . . ^ 

. ^ t . implementation a cache une contains an mtegral number of 

specification. The protocol specifies the types of parameters instruction packets 

that need to be extracted from this information. lO judatapath begins at the ICache and flows into (he 

The parameters extracted from the mstruction format pjpQ 908 via data lines 910. The number of data lines is 

include: defined as the instruction packet size in quanta. The FIFO 

908 temporarily stores packets of instructions on their way 
to the instruction register 912. The objective in designing the 

15 pipQ (Q make it deep enough to cover the latency of 

Q. // quantum (byus) (greatest common denominator of sequential instruction fetching from the instruction cache, 

alt possible mstniction widths, fetch widths) , , . . , . . • 1 

w^^ // minimum instmction width (quanta) Th^ control path must be able to issue instructions to the 

W[n.« // maximum instruction width (quanta) instruction register to satisfy a desired performance critc 



rion. In this case, the protocol defines the performance 

20 criterion as a rate of one instruction issue per clock cycle of 

The parameter, Qj, is a unit of data used to express the the processor. Note, one instruction may contain several 

size of instruction and fetch widths in an integer multiple of operations that are issued concurrently, 

bytes and is referred to as a quantum. This parameter is not xhg lU Control 903 is responsible for controlling the flow 

critical to the invention, but it does tend to simplify the of instruction packets from the FIFO 908 to a register that 

design of other components such as the alignment network 25 holds the next packet of instructions to be issued to the 

because it is easier to control shifting in units of quanta instruction register, called the ODR 914. In the example 

rather than individual bits. The parameters to be extracted shown in FIG. 20, the lU Control 903 controls the flow of 

also include, Wf^^„, the minimum instruction width in instruction packets from the FIFO 10 the ODR 914 through 

quanta, and W;,„^, the maximum instruction width in control lines 916 to the FIFO 908, and control hnes 918 to 

quanta. 30 ^ multiplexor 920. The control lines 916 from the lU Control 

The protocol also defines parameters relating to the to the FIFO are used to accept new instruction packets from 

instruction cache (ICache) as follows: ^he ICache and to instruct the FIFO to transfer the next 

instruction packet to the ODR via data hnes 922 from the 
FIFO to the multiplexor 920 and data lines 924 from the 

3^ multiplexor to the ODR. As explained above, the size of this 

Wa // instructicD packet width (quanta) (w^ ^ Wi^«, w^ = 2=*) data path is defined via the instruction packet size parameter. 

W^ // cache Une size (quanta) (W^ ^ W, o 2") ^^^^^^^ 9^3 ^^^^ ^^^^^^^ 

// cache access tunc (cycles) n'^^ . 1 * • * *• i * -/u t 
multiplexor 920 to select an instruction packet either from 

the FIFO 908 or directly from the instruction cache 904. The 

The instruction packet defines the amount of data that the data path 926 is usefiil in cases where the FIFO has been 

control path fetches from the ICache with each fetch opera- cleared, such as when the processor has executed a branch 

tion. In the protocol of the current implementation, the size instruction and needs to load the instruction packet contain- 

of the instruction packet is defined to be at least as large as ing the target of the branch into the ODR as quickly as 

the widest instruction and is expressed as a number of quanta possible. 

that must be a power of two. However, the packet need not The size of the FIFO (in packets) is another parameter in 

be that large if the widest instruction is infrequent. In thecontrolpathprotocol.Thesizeof the FIFO depends upon 

instruction format designs where the widest instruction is the maximum and minimum instruction widths of instruc- 

infrequent, the size of the control path can be reduced tions in the instruction fonnat as well as the ICache access 

because the extra cycles needed to fetch instructions larger time. The width of an instruction may be as large as the 

than the packet size will rarely be incurred. The computation 50 maximum instruction width, and may be as small as the 

of the packet size can be optimized by finding the smaUest minimum instruction width in the instruction format sped- 

packet size that will provide a desired fetch performance for fication. This constraint is merely a design choice in the 

a particular application or a set of application programs. current implementation, and is not necessary. The minimum 

The protocol specifies the method for fetching instruc- instruction width plays an important role in determining the 

tions from the ICache and the types of components in the 55 size of the FIFO because, in an extreme case, the ODR may 

lUdatapath. In the current implementation, the protocol be filled entirely with instructions of minimum size. In this 

includes a prefetch packet buffer, an On Deck Register case, the FIFO needs to be large enough to be filled with 

(OnDeckReg or ODR) and an instruction register (IR). As instruction packets already in flight from the ICaches as each 

shown in FIG. 20, the sequencer 900 is connected to the of the instructions is issued sequentially from the ODR. The 

instruction cache 904 via control lines 906. These control eo maximum instruction width also has an impact on the size of 

hnes include ICache address lines used lo specify the next the FIFO because, in the opposite extreme, the ODR may 

instruction to be fetched into the lUdatapath. Through these contain a single instruction. In this case, the FIFO must be 

control lines, the sequencer 900 selects the packet and able to supply an instruction packet lo the ODR at the 

initiates the transfer of each packet of instructions from the desired performance rate, namely, once per clock cycle, 

instruction cache to a Fnst-In, First-Out (FIFO) buffer 908. 55 while hiding the ICache access latency. 

The cache access time T^ is an ICache parameter pro- The parameters associated with the instruction fetch pro- 
vided as input to the control path design process. It is the • cess include the size of the FIFO and the branch latency. 
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These parameters are computed as shown below. The nec- 
essary FIFO size can be computed based on [Udatapath 
parameters and the instruction fetch policy. In case the 
policy docs not allow for stalling the processor due to 
interrupts, then the FIFO size can be reduced further. 



Npipo // size of prefetch FIFO (packets) (Npn^ IT^* W^^/ W^] 
Tb // branch latency (t, - Tap^th +Ta +1) 



The lU Control 903 controls the transfer of each instruc- 
tion from the ODR 914 to the instruction register 912. The 
lU Control provides control signals via control lines 927 to 
the ODR, which in turn transfers the next instruction to the 
instruction register 912 via data lines 928 and an alignment 
network 930. The alignment network is responsible for 
ensuring that each instruction is left aligned in the instruc- 
tion register 912. In the example shown in FIG. 20, the 
alignment network is comprised of a multiplexor for each 
quantum in the instruction register. Each of these multiplex- 
ors indicates where the next quantum of data will originate 
from in the ODR 914 or the IR 912. The lU Control 903 
provides multiplexor select controls via control lines 932 
based on parameters fed back from the decode logic via 
control lines 934. 

The control path protocol outlines the operation of the 
alignment network. There are two principle modes of opera- 
tion that the protocol of the alignment network must address: 
sequential instruction fetch mode; and branch target instruc- 
tion fetch mode. FIG, 21 illustrates the operation of the shift 
network protocol for sequential instruction fetching, and 
FIG. 22 illustrates the operation of the shift network for 
branch target instruction fetching. Before describing the 
operation of the shift network in more detail, we begin by 
describing the relevant parameters associated with the shift 
network. The parameters in the current implementation are 
as follows: 



// width of instruction register (quanta) (W^ = Wj, 
^crar width of current instruction (quanta) 

WcOT,umed // width of already used part in ODR (quanta) 
^tanBet position of branch target in ODR (quanta) 



As noted previously, the shift network controls where 
each bit of data in the instruction register comes from. This 
data may come from the IR, the ODR, or in some cases, from 
both the ODR and the top instruction packet in the FIFO. 
With each cycle, the shift network ensures that the next 
instruction to be executed is left aligned in the instruction 
register. In doing so, it may shift unused bits within the 
instruction register itself, it may transfer bits from the ODR, 
and finally it may also transfer bits from the top of the FIFO. 
In particular, if the instruction register contains unused bits 
from the previous cycle representing part of the next 
instruction, it shifts these unused bits over to the left, and 
then fills in the rest of the instruction register with the next 
group of bits sufficient to fully load the register. 

As noted above, the FIFO transfers instructions to the 
OnDeck register in packets. A packet remains in the ODR, 
and is incrementally consumed as the alignment network 
transfers portions of the bits in the ODR into the instruction 
register. The lU Control supplies control signals via control 
lines 936 to the instruction register 912 to issue the current 
instruction to the decode logic. The PC 938 in the sequencer 
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Specifies the memory address of the instruction currently 
being issued for execution. 

6.8.2.1 The Alignment Network Protocol 

^ FIG. 21 illustrates the two principle cases that occur in the 
shift network protocol for sequential instruction fetching. 
The first case is where the width of the current instruction in 
the instruction register, W^^^ is less than the remaining, 
unconsumed portion of the ODR, W^-W^^^^^. FIG. 21 
illustrates an example of this scenario by showing the 
transition of the state of the instruction register, ODR, and 
FIFO from one cycle to the next. In the first cycle 1000, the 
current instruction occupies the left-most section (see sec- 
tion 1002) of the instruction register, while a part of the next 
instruction occupies the remaining section 1004. Also, a 
portion 1006 of the ODR is already consumed, and the 
remaining section 1008 contains valid data. In this case, the 
shift network shifts the unused portion 1004 to the left of the 
instruction register (see section 1010 representing the trans- 
fer of the bits from the right of the instruction register to the 
left-most position). In addition, the shift network transfers 
enough bits to fill in the remainder of the instruction register 
(see section 1012) from the left-most valid data portion 1008 
in the ODR. 

25 

In the next cycle 1014, the instruction register contains the 
current instruction, aligned to the left, and a portion of the 
next instruction. The length of the current instruction 
becomes known only after decoding. The ODR contains a 
consumed portion 1016, which includes portions that the 
shift network already transferred in previous cycles. It also 
contains a remaining valid data portion 1018. The FIFO 
remains unchanged in this case. 
The bottom diagrams 1030, 1032 in FIG. 21 illustrate the 

35 case where the width of the current instruction is greater than 
the valid data portion C^a-^ consumed)- this case, the 
current instruction occupies a relatively large section 1034 
of the instruction register and the remaining portion 1036 
contains part of the next instruction. The consimaed portion 

40 1038 of the ODR is relatively large compared to the remain- 
ing valid data portion 1040, As a result, the shift register 
needs to transfer data from three sources: the unused portion 
1036 of the instruction register (shown being transferred in 
graphic 1042), the entire valid data portion remaining in the 

45 ODR 1040 (shown being transferred in graphic 1044), and 
finally, a portion in the top packet of the FIFO that is needed 
to fill in the rest of the instruction register (shown being 
transferred in graphic 1046). Since the ODR is fully 
consumed, the top packet of the FIFO needs to be advanced 

50 to the ODR. However, this example shows that a portion of 
the packet in the top of the FIFO is already consumed when 
the packet is transferred into the ODR (see section 1048 
being transferred into the ODR), which leaves a consumed 
portion 1050 in the OnDeck register. 

55 FIG. 22 illustrates the two principle cases that occur in the 
shift network protocol for branch target instruction fetching. 
When the processor executes a branch instruction, the con- 
trol path should load the instruction containing the target of 
the branch as quickly as possible. There are a variety of 

60 schemes to accomplish this objective. Even within the 
.specific protocol described and illustrated thus far, there are 
alternative ways to define the target fetch operation. In the 
example shown in FIG. 22, the target of a branch is allowed 
to reside anywhere in an instruction packet. This may result 

65 in the case where the next portion of valid data to be loaded 
into the instruction register (the target data) spans two 
instruction packets. One way to avoid this case is to require 
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the application program compiler to align branch targets at programming language. While the software may be ported to 

the beginning of instruction packets. However, the example a variety of computer architectures, the current implemen- 

showD in FIG. 22 is more general and handles the case where tation executes on a PA-RISC workstation or server running 

the target data spans instruction packets. under the HP-UX 10.20 operating system. The functions of 
The lop diagrams 1100, 1102 Ulustrate the case where the 5 the CP synthesizer software illustrated in FIG. 23 are 

target data is entirely within an instruction packet. This case described in more detail below, 
is defined as a packet where the width of the instruction 

register, Wj.^ is less than or equal to the width of a packet, ^-^-^.l Collecting Parameter Values 

W^, less the position of the Urget instruction relative to the CP synthesizer begins by collecting and adjusting 

start of the packet, P In the first cycle 1100 the current lo ^^^^ p,,,^eters, Q„ W^,„, W,, T,, and W, as 

instruction occupies the left -most porUon 1104 of the shown in step 1200. It calculates as the greatest common 

instrucUon register. In the shift operation, the entire contents denominator of aU possible instruction widths and fetch 

of the instruction register are considered invalid. As such, ^^^^ ^ extracts W^^, W^.„ from the instruction format, 

the shift network fills the instruction register with new bits ^nd derives and possibly adjusts as defined above. 

sufiBcient to fill it entirely (as shown in graphic 1106. The is ^^^^^^ ^ ^^^^^^ .^^^^ 

starting bit in the ODR for this shift operation is identified parameters to the control path design, 

by P^„,, (see invalid portion 1108 in the ODR, which has t-u r^n *u • * uxYr u * * *u * 

^flu Ti \ c- \i j^L <• *u • * /• •* The CP synthesizer computes #W^„Jb Its, a parameter that 

a width P„™,). Since the width of the instruction register ,^ . pu . a ^ * f^u i *u r 

lus P S still less than or e ual to W all of the new defines the number of bits needed to represent the length of 

J forget 4 A» jjjg current instruction in quanta. The length of the current 

data comes irom the ODR. Alter the shift, the consumed ^ - ^ ^. . , „j r 

^- r.i- ^r^r. • i * ^^^n j mstruction may be zero or as large as W,-„„,. Therefore, 

portion of the ODR occupies the left-most portion 1110 and ^ bits is com uted as flo +1)1 The lU Control 

some valid data for the next instruction may reside in the ^ ^ j j n-,^ • 

remainin ortion 1112 receives w^„^ from the decode logic (See hnes 934 in FIG. 

remaining po ion . ^^^g compute the appropriate shift amount for 

The bottom two diagrams 1120, 1122 show the case where shift and align network. The sequencer also uses this 

the target data spans an instruction packet. This case is ^^^^^^ ^p^^te the PC with the address of the next 

defined as a packet where the width of the instruction instruction to execute. The CP synthesizer determines the 

register, W^, is greater than the width of a packet, W,, less ^^^j^er of instruction register multiplexor selection bits 

the width of the ofifeet to the target instruction inside the #iRmux ;bits as shown in step 1200, from the following 

packet, P„^,,. In the first diagram 1120, the current instruc- expression: #IRmux,,,bits-flog2(W^+W,„^-W,„Jl in 

tion occupies the left-most portion 1124 of the instruction ^its. This is the number of bits needed to select between 

register. In the shift operation, the entire contents of the (W^+W,„_-W,^ J input quanta choices for each quantum 

instruction register are considered mvalid. As such, the shift multiplexor placed before the instruction register, 
network fills the instruction register with new bits sufficient 

to fill it entirely, but to do so, it must take bits from the ODR 6.8.3.2 Allocating the Instruction Register and 

and the next packet from the ICache (as shown in graphics Sequencer 
1126 and 1128). The starting bit in the ODR for this shift 

operation is identified by P,^^^^, (see invalid portion 1130 in Next, tiic CP synthesizer selects an instruction register 

the ODR, which has a width P^^^J. Since the width of the the macrocell database as shown in step 1202, and sets 

instruction register plus P„^^, is greater than W^, some of the width of the instruction register equal to W,,„^. 

the new data comes from the ODR and some comes from the The CP synthesizer also selects a sequencer from the 

next packet from the ICache. To get the target data into the macroceU database in step 1204. The sequencer includes 

instruction register, the control path may require two cycles. logic to process the branch addressing, logic to handle 

The shift network transfers valid bits from the ODR (as interrupts and exceptions and logic to issue instruction 

identified by ^ target) to the IR and transfers the next packet fetching from the ICache. The choice of the sequencer 

(1132) from the ICache into the ODR. It then transfers vaHd depends on the architectural requirements specified during 

bits from the ODR (1128) sufficient to fill the IR. This leaves the design of the datapath and the instruction formal, i.e., 

a portion of the bits in the ODR 1134 m-(^ a-^ target) whether the processor needs to handle interrupts and 

invalid. exceptions, branch prediction, and control and data specu- 

The shift network protocol ouOined above specifies how lation. It is independent of the design of the instruction unit 
the lU Control logic controls the select ports of the raulti- ° data path itself. Therefore, we assume that we have a set of 

plexors in the shift network in order to make the selection of predesigned sequencer macrocells available in the macrocell 

the appropriate quanta in the IR, ODR, and FIFO. Further database from which one is selected that matches the archi- 

details about the synthesis of the shift network are provided tectural parameters of the datapath and the instruction for- 

below. mat. 

The final aspect of the control path protocol is the decode ic o -> ti -u- i . *■ j t 

^ • . *u 1 • cTr- in *u J A 6.8.3.3 Building the Instruction Decode Logic 

logic. Referrmg agam to the example in FIG. 20, the decode ^ ° 

logic (e.g., decode units 940-944) interfaces with the The CP synthesizer generates decode logic from the 

instruction register, decodes the current instmction, and instruction format specification, which is provided in the IF 

dispatches control signals to the control ports in the data tree 1206. This section describes how the CP synthesizer 

path. The CP synthesizer computes decode tables from the generates the decide tables progranunatically. 

instruction format design as explained below. -ptje CP synthesizer generates the decode logic by creating 

6 8 3 Control Path Desi n decode tables that specify the inputs and outputs of the 

^ decode logic. In building a decode table, the CP synthesizer 
FIG. 23 is a flow diagram illustrating the operation of a 65 specifies the input bit positions in the instruction register, the 

software implementation of the CP synthesizer illustrated in input values for these bit positions, the corresponding con- 

FIG. 19. The CP synthesizer is implemented in the C*^ trol ports, and finally, the output values to be provided at 
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these control ports in response to the input values. There are 
two general cases: 1) creating decode table entries for select 
fields (e.g., bits that control multiplexors and tri-state 
drivers); and 2) creating decode table entries for logic that 
converts opcodes. In the first case, the CP synthesizer 
generates the address selection logic needed to map bit 
positions in the instruction register with shared address 
control ports in the data path. It also generates the appro- 
priate select values based on the select field encoding in the 
instruction template. In the second case, the CP synthesizer 
generates the opcode input values needed to select a par- 
ticular opcode in a functional unit based on the opcode field 
encoding in the instruction template. Both of these cases are 
described further below. 

The implementation divides the decode logic into two 
types of components: the template decode logic (synthesized 
in step 1208) and the FU decode logic, one per FU macrocell 
(synthesized in step 1210). The template decode logic is 
responsible for decoding all the information that is relevant 
for the entire instruction including the template width, the 
end-of -packet bit and the position of register file address 
port bits. The FU decode logic decodes all the information 
that is relevant for one FU macrocell including its opcode 
and the select ports of the data multiplexors and tri-state 
drivers. In step 1208, the CP synthesizer constructs a decode 
table for a template decode programmable logic array 
(PLA). As shown in the example FIG. 20, the template 
decode PLA provides information (W^^^and EoP parameter 
values) to the lU Control to drive the instruction shifting 
network. It converts the template ID into W^„^ and feeds 
this information to the lU Control. It also provides the 
consume to end-of -packet (EoP) bit to the lU Control. 

Based on the template ID, the template decoder also 
generates the mux select inputs in cases where instruction 
fields from different templates map to the same control ports 
in the datapath. For example, it computes select values for 
the mux select ports of register file address port multiplexors 
(RF port addrmux^^/; see, e.g., multiplexor 836 in FIG. 20). 

To illustrate decode logic generation for select fields, 
consider the example of the RF address port multiplexors. 
The CP synthesizer builds a decode table for the address port 
multiplexors by traversing the IF tree to find the template 
specifier fields. The template specifier in the instruction 
identifies the template to the decode logic. This is significant 
because a number of different bit positions may map to the 
same register file address port depending on the instruction 
template. The Table 1 shows an example of this scenario. 



TABI^ 1 



It mp late 


Bit Positions 


Mux Inputs 


Mux select 


T3 


0-3 


11 


00 


T2 


10-13 


12 


01 


T3 


1-3, 10 


13 


10 


T4 


10-13 


14 


11 



In the example shown above, four different sets of bit 
positions map to the same register file address ports, depend- 
ing on the instruction template. The decode logic, therefore, 
needs to generate the appropriate mux select signal to map 
the appropriate bit positions in the instruction to the register 
file address ports depending on the template specifier bits. 

For each template, the CP synthesizer traverses the IF tree 
to the template specifier field and adds the bit encoding to 
the decode table as an input. It finds the corresponding bit 
positions from different templates that map to the same 
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register file address ports and assigns them to the input ports 
of a multiplexor. Finally, it assigns mux select values so that 
the decode logic instructs the mux to select the appropriate 
mux inputs depending on the template specifier. 

5 To illustrate decode logic generation for opcode fields, 
consider an example where the bits used to encode the 
opcode field in the instruction do not match the number of 
bits used to encode the opcode on the functional unit 
macrocell. The CP synthesizer functional unit constructs the 

10 FU decode PLA in step 1210 in a similar fashion as the 
template decode PLA. In particular, it builds a decode table 
that maps instruction register bits to data path control ports 
of the functional units in the data path. It traverses the IF tree 
to find the fields for the FU opcode fields. The CP synthe- 

15 sizer finds the instruction register ports that these fields have 
been assigned, and maps them to the opcode control ports. 

The opcode field in the IF tree identifies the desired 
operations in an operation group and the corresponding 
functional unit to the decode logic. The opcode in the 

^ instruction field may need to be translated into a different 
form so that it selects the proper operation in the functional 
unit. Table 2 shows an example of this scenario. 

TABLE 2 



Opcode encoding 


FU input 


00 


0000 


01 


1011 


10 


1100 


11 


0010 



In the above example, the instruction selects one of four 
different operations to be executed on a given fimctional unit 

2j in the data path. The functional unit, however, supports more 
operations, and thus, uses a four bit input code to select an 
operation. In this case, the CP synthesizer generates a 
decode table for decode logic that vnll select the proper 
operation based on the opcode encoding in the instruction 
register. To accomplish this, it traverses the IF tree to find the 
opcode field, and the corresponding bit encoding, control 
port assignment, and bit position for this field. The opcode 
field in the IF tree is annotated with information that maps 
a bit encoding in the instruction to a particular input encod- 

^2 ing for a functional unit in the data path. The CP synthesizer 
assigas the inputs of the decode logic to the bit positions of 
the opcode field, and assigns the outputs of the decode logic 
to the opcode control ports of the functional unit. 
The FU decode logic for the control ports of the muxes 

50 and tri-states in the interconnect between the functional units 
and register files is generated based on the select fields at the 
10 set level in the IF tree in a similar fashion as described 
above for the RF address MUXes. 

Once the decode logic tables are created, a variety of 

55 conventional logic synthesizer tools may be used to create 
hardware specific decode logic from the decode tables which 
is not necessarily restricted to a PLA-based design. 

6.8.3.4 Assembling the Instruction Unit 

60 In step 1212, the CP synthesizer builds the remainder of 
the instruction unit, including the lUdatapath and the control 
logic between the lUdatapath and sequencer. In this step, the 
CP synthesizer allocates the FIFO, ODR, and ahgnment 
network by selecting AIR macrocells from the macrocell 

65 database and instantiating them. It maps the control ports of 
these components in the lUdatapath to the control outputs of 
the lU Control logic. The lU Control logic controls the 
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80 



10 



15 



25 



30 



behavior of ihe lUdatapalh at each cycle by providing 
specific bit values for each of the control ports of the 
lUdalapath components. The logic may be specified as a 
behavioral description of a finite state machine (FSM). From 
this description, conventional logic synthesis may be used to 
generate the FSM logic thai forms the lUdatapath control 
logic. 

When it allocates the sequencer macrocell, the CP syn- 
thesizer allocates the sequencer ports responsible for ICache 
control and addressing and connects it to the corresponding 
ICache ports (see, e.g., 906, FIG. 20). Hie number of address 
lines depends on #W^^^^jJbits, the nximber of ICache address 
bits. The memory address register (MAR) 902 drives the 
address port of the ICache while a fetch request bit (FReq) 
generated by the lU Control logic controls when new 
instruction packet fetches are initiated. 

The CP synthesizer allocates the FIFO (908, FIG. 20) by 
computing the size of the FIFO as described above and 
constructing a macrocell instance from the macrocell data- 
base with ^j^fjro packet registers of width and a number 
of control and data ports. The data output of the ICache is 
connected to the data input of the FIFO. The various FIFO 
control ports are driven by the corresponding ports of the lU 
Control logic (916, FIG. 20). 

The CP synthesizer also allocates the ODR (914, FIG. 20) 
by constructing a macrocell instance of a register having a 
width and having corresponding control and data ports. 
It synthesizes the ODR*s input side multiplexor (920, BG. 
20) by constructing a multiplexor from the macrocell data- 
base having a width W^. The two inputs of the multiplexor 
920 are connected to the FIFO and the ICache respectively. 
The selection control and the ODR load control ports are 
driven by the corresponding ports from the lU Control logic 
(918, 926, FIG. 20). 35 

The CP synthesizer additionally synthesizes the branch 
FU control and address lines to interconnect the branch 
control ports of the sequencer with control ports of the 
branch FU. 

It further allocates the instruction register shift network ao 
(930, FIG. 20), and connects its control ports to the lU 
Control logic (932, FIG. 20). FIG. 24 illustrates aspects of 
the lUdatapath to illustrate how the CP synthesizer allocates 
the shift network. In what follows, we assume that the 
various quanta in the IR, the ODR, the FIFO, and the cache 45 
are logically numbered sequentially starting from 0 as 
shown in FIG. 24. 

As explained above, the shift network has a multiplexor 
for each quantum in the instruction register numbered 0 
through W^-1. In the following discussion, k represents the 50 
number of a given multiplexor (O^k^W^-l). 

Each quantum multiplexor k selects among all quanta 
between the following two extremes: 

1) k+W(,„£„ (last inst. was minimum size); and 

2) k+W^+W^-1 (last inst, was maximum size and all of 
ODR was consumed). 

Hie CP synthesizer creates instances for each multiplexor 
with enough input ports to select among the number of 
quanta reflected above. This number is (k+W^+Wjjj-l)-(k+ 

The choices for lU selection control for a quantum mux 
k is given by: 

1) k+W^„^ (sequential access and k+W^„^<Wj.^; 

2) ^^'^curr^'^ consumed ODR/FIFO (sequcutial 
access and ^+^curr—^iR)'> 

3) k+W^+P,^^^^, from ODR/FIFO (branch target access). 



The choices for lU selection control for ODR/FIFO 
quantum k is given by: 

1) k+W^ from FIFO (advance FIFO by a fuU packet); 

2) (k-'^jjd % W4 from I-Cache output (load directly from 
I- Cache); and 

3) no shift (disable ODR load/FIFO advance). 

The CP Synthesizer generates the lU Control logic to 
control the shift network according to the constraints given 
above. The design of the lU Control logic is discussed 
below. 

6.8.3.5 Building lU Control Logic 

The instruction fetch protocol described above is imple- 
mented in control logic that keeps track of the packet 
inventory — the packets in flight, packets in the prefetch 
buffer, and the unconsumed part of the ODR. It also issues 
instruction cache fetch requests, FIFO load and advance 
requests, and an ODR load request at the appropriate times, 
and provides the appropriate selection control for the shift 
and align network and other multiplexors in the instruction 
pipeline. Finally, the control logic is also responsible for 
flushing or stalling the pipeline upon request from the 
sequencer due to a branch or an interrupt. 

The control logic is expressed in the following 
pseudocode. 

Pseudocode for lU Control Logic 



60 



Module lU Control (cachcPKRdy, flushpipc, EOF: 
in boolean; Wcu^: in integer) 



1: 


// Design time constants: pktSize (W^, invSize 




(l^A* ^uiuVWaD 




2: 


// InUraal state: numFIFOPkts(O), numCachePkts{0), 




^coniumedO^A) 




3: 


if (numFIFOPkts + 
numCachcP]cts<invSizc; then 




4: 


Request I-Cache fetch; 


/Aaunch fetches to keep 


5: 


numCachcPkts++; 


inventory constant 


6: 


cDdif 




7: 


if (cachcPktRdy) then 


//packets arc ready 


8: 


nximCachcPkts — ; 


cycles later 


9: 


if CWcan.«»«J ^ && 

numFIFOPkts > 0) then 




10: 


Load cachcPkt into ODR; 


//put pkt directly into 


11: 




ODR, if empty 


12: 


else 


//otherwise, save pkt in 


13: 


Load cachePkt into FIFO; 


FIFO 


14: 


numFIFOFkts++; 




15: 


endif 




16: 


CQdif 




17: 


if (y^c^ru.^ ^ && 
numFIFOPkts >0) then 


//draw next pkt from FIFO 


18: 


Load FIFOPkt into ODR; 




19: 






20: 


advance FIFO; 




21: 


numFIFOFkts--; 




22: 


eadif 




23: 


if (flush Pipe) then 


//branch or interrupt 


24: 


flush I-cachc and FIFO; 


processing 


25: 


numCache?kts«=0; 




26: 


numFIFOPkts »0; 




27: 


w 

coasumed- WA 




28: 


elseif (HOP) then 


// skip to end-of-parket 


29: 


Shift IR to align to next pack boundary; 




30: 






31: 


else 


// shift to next 


32: 


Shift IR by Wurr'; 


instruction 


33: 


adjust ^coosxuntd- 




34: 


eadif 





65 



The control logic is expressed as pseudocode that consists 
of a sequence of conditions and various actions to be 
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performed under those conditions. The logic keeps track of 
the inventory of packets internally including those in flight 
in the instruction cache pipeline (numCachePkts) and those 
sitting in the prefetch buffer (nuniFIFOPkts). This is uised to 
issue a fetch request whenever the inventory size falls below 
the threshold (line 3). The corresponding instruction packet 
is ready to be read at the output of the cache T^ cycles after 
the fetch is initiated (line 7). This packet may be loaded 
directly into the ODR if the rest of the pipeline is empty (line 
9), or it may be saved in the FIFO (line 12). These packets 
are later loaded into the ODR as needed (line 17). 

Upon encountering a taken branch signal or an interrupt 
signal from the sequencer (flushPipe), the control logic 
flushes the instruction pipeline by rescting the internal state 
(line 23). This enables the pipeline to start fetching instruc- 
tions from the new address from the next cycle. Otherwise, 
the next instruction in sequence needs to be aligned into the 
instruction register (line 28). If the end-of-packet (EOF) bit 
is set, the current packet residing in the ODR is considered 



82 

7.0 Implementation Examples 



The following sections provide a specific example of how 
the VUW design system detailed above is used to perform 
J design space exploration. 

7.1 Application Characterization 

Before exploring the VLIW design space, the system 
begins by characterizing the application program by per- 
10 forming the following steps: 

1. A reference VLIW is constructed and the application is 
compiled onto it 

2. A histogram of all the literal values in the program is 
built. This histogram is later used to help optimize the 
instruction format design. 

3. The dynamic and static opcode usage is measured. 

4. A table is constructed of (frequency, critical path 
length) data for each exit from a hyperblock within the 



15 



to be fully consumed and the IR is shifted to the next packet 20 program. This table can be used during the walk to estimate 
available. Otherwise, the IR is shifted by the width of the 
current instruction. In either case, the multiplexors of the 
shift and alignment network in front of the IR are provided 
with the appropriate selection control as described above. 

The control logic shown above may be synthesized into a 25 
finite-slate machine (FSM) using standard synthesis tools 
that translate a functional description such as that given 
above and produce a concrete implementation in terms of 
gates or PLA logic along with control registers to keep track 
of the sequential state. 30 

While we have illustrated a specific control path protocol, 
it is important to note that the control path synthesizer 
program can be adapted for a variety of different protocols. 
Both the structural and procedural aspects of the protocol 
may vary. The protocol may specify that the alignment ^5 
network is positioned between the instruction register and 
the decode logic. In this protocol, for example, the instruc- 
tion register has a wider width (e.g., a width of one packet) 
and the alignment network routes varying width instructions 
from the instruction register to the decode logic. This ^0 
protocol is based on a procedural model of "in-place" 
decoding, where instructions are not aligned in the IR, but 
rather, fall into varying locations in the IR. The protocol 
procedure defines a methodology to determine the start of 
the next instruction to be issued from the IR. 

The procedural model may be based on a statistical policy 
where the width of the control path pipeline is optimized 
based on the width of the templates in the instruction format. 
In this approach, the control path designer minimizes the 
width of the pipeline within some performance constraint. 
For example, the width is allowed to be smaller than the 
widest instruction or instmctions as long as the stall cycles 
needed to issue these instructions do not adversely impact 
overall performance. When the width of the pipeline is less 
than the widest instruction, one or more stall cycles may be 
necessary to issue the instruction to the decode logic. 
Performance is estimated based on the time required to issue 
each instruction and the corresponding frequency of the 
instruction's issuance. 



45 



50 



6.9 Generating a Structural Description 

The system produces a structural description of the pro- 
cessor hardware at the RTT^level in a standard hardware 
description language such as VHDL. This description can be 
linked with the respective HDL component libraries pointed 
to by the macrocell database and processed further for 
hardware synthesis and simulation. 
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the performance of machines that have not yet been evalu- 
ated. 

5. The application is partially compiled and simulated 
using the re-targetable compiler to produce an intermediate 
representation (IR) annotated with execution statistics. The 
transformations performed by the compiler are independent 
of the VLIW parameters, except for predication and specu- 
lation. This phase is performed only four times for an 
application, once for each combination of (predication, 
speculation). 

7.2 VLIW Specification and Synthesis 

Candidate VLIW processors may be specified using an 
abstract ISA specification as described in Section 6. 

The following list provides VLIW parameters used to 
specify a candidate processor. 

1. Predication — A predication parameter indicates 
whether predication is supported by both hardware and 
compiler, or by neither. 

2. Speculation — A speculation parameter indicates 
whether speculation is supported by both hardware and 
compiler, or by neither. 

3. Registers — A register file specification specifics the 
type of registers in the candidate processor, and the number 
and size (i.e. number of registers in the file) of each type. 

4. Functional Units — ^The functional units are selected 
from a macroceU library including functional units of dif- 
ferent types, e.g., integer, floating point, memory and 
branch. 

5. Literal widths — ^Thc input specification indicates the 
type of literals, e.g., memory, branch, and integer data 
literals, and their widths. 

The VLIW processor may be designed using either 
homo genous or heteroge neous func tional units , t^ach 
instance of a homogenous ttmctionai u mt is bll y tuncti onal 
a nd is identical to all other instances ot that type on a given 
VLIW processor. A heterogeneous functional imit instance is 
custom created to contain only a subset of all possible 
operations associated with that type. 

Constructing a set of functional units for a heterogeneous 
VLIW relies upon the dynamic opcode statistics generated 
by the application characterization step. The following 
pseudo-code fllustrates the software used to add a functional 
unit of a given type (such as a floating unit) to a VLIW: 
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addF*irDctioDUnit(vilw) { 
for each opgroup 

tally instances of that opgroup in existing functional 
units in the vliw 

find opgroup with largest neediness 5 
threshold=largest_neediness * ,75 
create an empty functional unit 
for each opgroup type 

if oecdiness(opgroup type)>threshold 
add opgroup to functional unit 
add functional unit to vliw 

} 

neediness (op group) { 

if dynamic usage(opgroup)>0 && instances 15 

(opgroup)==0 

return infinity 
else 

return dynamic_usage(opgroup)/(inslances(opgroup)+ 
1) 20 

} 

Note that if an application program never uses any of the 
functionality within an opgroup, that opgroup never gets 
instantiated, otherwise the number of instances of an 25 
opgroup is roughly proportional to dynamic usage of opera- 
tions implemented by that opgroup. 

7.3 VLIW Evaluation 

30 

7.3.1 Performance 

The performance of an application on a VLIW processor 
may be evaluated by compiling it onto the \nw and then 
simulating its execution. This is done in several phases: 
Phase 1 ^ 35 

The desired VLIW architecture is specified and synthe- 
sized with an unoptimized instruction format. 
Phase 2 

The appropriate intermediate representation produced by 
the application characterization step (depending on the 40 
desired values of predication and speculation) is compiled 
with the re -targe ted compiler onto the desired VLIW target 
architecture. 
Phase 3 

An optional phase (enabled or disabled by the space- 45 
walker user) that takes the output of phase 2, creates an 
optimized instruction format, and resynthesizes the VLIW 
so that the correct instruction decode logic will be synthe- 
sized. This generally produces a VLIW that consumes more 
area than the original, but this is counterbalanced by the fact 50 
that the new VLIW requires less code size (and hence less 
ROM area) because of the optimized encoding. 
Phase 4 

Hie compiled application is assembled and linked to 
determine the application's code size (and ROM area) as 55 
well as an estimate of the number of cycles needed to 
execute it (assuming perfect caches that never miss). 

To speed up the evaluation process, it is possible to 
replace phases 2, 3 and 4 with a much faster estimation 
phase that produces a rougher estimate of the nxunber of 60 
cycles needed to execute the application. This is done by 
invoking the re-targeted compiler with a special flag that 
causes it to generate a table of the resource bound path 
lengths (rbpl) for each hyperblock exit in the program. 
Performance is then estimated from this table along with the 65 
(frequency, critical path length) table built during applica- 
tion characterization using the following method: 




cycles 

+«frequency(exit) * max (critical__path__l6ngth(exit), 
rbpl(cxit)) 



The performance of a candidate processor may also be 
evaluated using the following approach. 
Phase 1 

A code simulator conducts a full simtilation of an appU- 
cation program to determine how many times each basic 
block is visited. This information indicates how many times 
the apphcatioD uses abstract operations in the basic block. In 
this context, abstract operations refer to operations that need 
not be assigned to specific functional units, or to instructions 
in the instruction format of a candidate processor. 
Phase 2 

The scheduler and performance evaluator modules use the 
MDES generated for a candidate processor to map the 
abstract operations to physical resources in the candidate 
design (e.g., functional units) and to calculate an estimate of 
the execution time of the program. 

The first phase need only be done once for the apphcation 
program. The code simulator employs a sequential model to 
simulate the execution of the program. The simulator enu- 
merates aU of the basic blocks in the program and the 
number of times each is visited during execution of the 
program. This provides a summary of the usage of each 
abstract operation. 

During design space exploration, the evaluation routine 
scans the MDES and maps each abstract operation to the 
physical resources it uses during execution. In particular, the 
scheduler in the re-targctable compiler uses the MDES to 
map abstract operations to physical processor resources. 
With this mapping, the scheduler can provide the execution 
time for each basic block. Knowing the execution time and 
the number of visits, the performance evaluator provides an 
estimate of the execution time of the program. 

Another way to evaluate performance is to synthesize a 
candidate processor (including its MDES), then compile the 
program using the MDES to generate machine-specific 
code. After generating this code, a code simulator performs 
a full simulation of the machine specific code to determine 
the execution time of the program. 

Yet another way to evaluate performance, as alluded to 
above, is to estimate performance based on the number of 
visits to a basic block, the critical path length, and the 
resource bound path length of the basic block. For example, 
the performance may be estimated as above by summing the 
number block visits multiplied by the greater of the resource 
bound path length and critical path length for each block. 

In each of these techniques, the performance evaluator 
may optionally use an estimate of the memory performance, 
e.g., an estimate of the number of stall cycles by estimating 
the niunber of memory references that result in a cache miss 
and multiplying that number by the average niunber of stalls 
caused by each miss. 

7.4 VLIW Walking Heuristics 

The process of exploring a design space involves the use 
of a search procedure to efiSciently select candidate proces- 
sors for evaluation. The search procedure begins with one or 
more initial candidate processors and then attempts to find 
other candidate processors that are at least as good as the 
seed candidate or candidates relative to the evaluation 
criteria. 
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The following pseudo code provides an example search 
procedure for fiading a set of pare to processor designs. 

MAIN_FUNCnON{ 

Define SEED: a set of one or more initial processors from 

which the search procedure begins. 
Define PARETO: the current set of "best" processors. 

This set is initially empty. 
Define CANDIDATES; a set of one or more processors 

which need to be searched 
Define NEIGHBORS: the set of promising neighbors of a 

given processor 

CANDIDATES^SEED 

WhUe CANDIDATES is not empty do{ 

remove a candidate processor C from CANDIDATES 

if (C has ah-eady been explored) break. 

evaluate the cost and performance of C 

mark C as already explored 

if (C is Pare to when tested against all processors in 
PARETO){ 

eliminate processors in PARETO which are eclipsed 

by C 
add C to PARETO 

NEIGHB0RS-FIND„NEIGHBORS(C) 
add NEIGHBORS into CANDIDATES 
break 

eise{ /*C is not Pareto */ 
break 



} 



10 



20 



25 
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r nNAL„RESULT_SET-PARETO 
stop^— 

} 

The search procedure above repetitively refines a set of 
Pareto processors during a walk of the design space. The 
direction of the walk through the design space depends on 
the procedure for selecting neighbor processors. 

Tlie following pseudo code illustrates examples of such 
procedures. One approach (ascending) starts with an inex- 
pensive seed processor and adds resources (RES, or RESl, 
RES2 . . . ) to improve performance. In this approach, the 
procedure adds concurrency sets or removes exclusion sets 
to improve ILP, and thereby improve performance. Another 
approach (descending) starts with an expensive processor 
and removes idle resources (RES, or RESl, RES2 . . . ). In 
this approach, the procedure adds exclusion sets or removes 
concurrency sets selectively to remove under-utilized 
resources while reducing processor cost. 

In the process of selecting neighbors, the search procedure 
modifies the parameters of a prior candidate process. These 
parameters may be stmctural (e.g., adding/removing ports, 
macrocells, etc.) or non-structural (adding/removing ILP 
constraints such as exclusion and concurrency sets, adding/ 
removing operations, etc.) Macrocells may be eliminated to 
reduce register file porting, exclusions may be added to 
reduce porting, exclusions may be added to reduce instruc- 
tion width, etc. In short, a variety of parameter modifications 
or selections can be made to define new candidate proces- 
sors. 

function RESULT=FIND(NEIGHBORS(C){ 

Define C: a processor whose neighbors need to be iden- 
tified 
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} 



Define RESULT: a set of neighboring processors 

(ascending, \ 

descending or both) 
RESULT=empty set 

if (ASCENDING){/*enables upward search in cost */ 
identify all resources BUSYRES whose utilization 
exceeds 

UPPER_THRESHOLD for each resource RES in 
BUSYRES { 

identify parameter P which increases RES 
identify a processor CNEW which is like C but with 

increased P 
add CNEW to RESULT 

} 

/*one might also consider pairs, triplets, etc. of 
resources 

e.g. */for pairs of resources (RESl, RES2) in 
BUSYRES { 

identify parameters PI, P2 which increase RESl & 
RES2 

identify a processor CNEW which is like C but with 

increased PI & P2 
add CNEW to RESULT 



} 



if (DESCENDING) {/* enables downward search in cost 
*/ 

identify all resources IDLERES whose utihzation is 
less than 

LOWER_THRESHOLD for each resource RES in 
1DLERES{ 

identify parameter P which decreases RES 
identify a processor CNEW which is like C but with 
decreased P 

add CNEW to RESULT 
} 

/* one might also consider pairs, triplets, etc. of resources 
e.g., ♦/ for pairs of resources (RESl, RES2) in IDLERES{ 
} 

identify parameters PI, P2 which decrease RESl & 
RES2 

identify a processor CNEW which is like C but with 
decreased PI & P2 
add CNEW to RESULT 

} 

return RESULT 
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The following subsections describe a number of alterna- 
tive walking heuristics. In these descriptions, the following 
terms are used: 

k-neighbor. A k-neighbor of a VLIW machine is another 
VLIW machine that has at least 1, and up to k, 
parameters that are incrementally larger than in the first 
machine. For example, if machine A has one more 
integer functional unit than machine B, and the two 
machines are otherwise identical, then A is a 
1-neighbor of B (it is also a 2-neighbor, 
3-neighbor, . . . ). For 

register files, "incrementally larger" is with respect to a 
quantum not necessarily equal to 1. For example, if the 
quantum for integer register files were 8, machine A 
would be a 1-neighbor of machine B if it had 8 more 
integer registers than B and the machines were other- 
wise identical. 
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•k-neigbbor. A machine that has at least 1, and up to k, 
parameters that are incrementally smaller than another 
machine. 

candidates. A set of unevaluated VTJW machines that will 
be evaluated during the course of a walk. 
Pareto Descent 

The pareto descent walk attempts to stay close to the 
pareto by confining its exploration to the neighborhoods of 
known pareto points. An example of the pseudo code for this 
walk is: 

Pareto Descent (local walk) 
candidates+-cheapest VLIW 
loop { 

remove cheapest from candidates 
evaluate it 
if point is on pareto 
candidatcs+= 

k-neighbors of point 
} until candidates is empty 

Delft 

Named after research conducted at Delft, this heuristic 
takes multiple sweeps across the design space, from cheap 
machines to expensive and back again, putting more empha- 
sis on minimizing cost or performance, depending on the 
direction of the sweep. An example of the pseudo code for 
this approach is: 

Delft 

currcnt«most expensive machine 

for (exponent=l; exponent<=3; exponent+=0.5){ 
do until current==NULL // reduce sweep 

current=-l-neighbor with better reduce-quality 
do until current==NULL // extend sweep 

current- 1 -neighbor with better extend-quality 

} 

reduce-quality (machine) { 

return 1/ (cost(machine) * cycles(machine)*^^'^"^ 

} 

reduce-quality (machine) { 

return 1/ (cost(machinc)"^"*^"' * cycles(machine)) 

} 

Conjugate Gradient 

Another approach is to define an objective function for a 
candidate processor's cost, performance, or cost and 
performance, and then evaluate the gradient of this function 
for candidate processors to identify candidate processors for 
which the objective function is a local maximum or mini- 
mum. 

Conjugate Gradient 

Candida tc^cheapest VLIW 

gradient- Vf(candidate) 
conjugate Gradient-gradient 
loop } 
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compute Vf(candidate) (|D 

update conjugate gradient 

s candidate=best performing machine 
along conjugate gradient 

} until hit local minimum 

10 Two examples of the objective functions include: 

f(machine)-cycles(machine) 
f(machine)sacost(machine) * cycles(machine) 

As illustrated in the above examples of walking 
heuristics, there are a variety of alternative approaches for 
selecting candidate processors in a design space exploration 
process. 

20 8.0 Conclusion 

The preceding sections describe a processor synthesis 
system as well as methods for automated design space 
exploratiocL As noted, there are many ways to implement the 

25 system and methods. In the design space exploration 
process, the search procedure specifies candidate processors 
and evaluates them in an attempt to identify a processor 
design or set of designs that are optimal for a particular 
application program. The processor may be specified in 

30 terms of an abstract non-structural ISA specification, a 
structural specification, or a combination of structural and 
non-structural parameters. The merits of a candidate pro- 
cessor may be measured in terms of its cost and perfor- 
mance. A variety of internal and external metrics may be 

35 used to evaluate cost and performance, and these metrics are 
not limited to chip area or processing time. Rather, they 
extend to additional metrics such as power consumption, 
circuit complexity, and resource utilization. In many cases, 
it is preferable to evaluate candidate processors without 

40 synthesizing a detailed structural description of its datapath, 
control path, or instruction format. In these cases, the 
abstract instruction set architecture or a high level structural 
description may be used to evaluate the merits of a candidate 
processor. 

45 Based on the evaluation of a candidate processor, the 
system may add new candidates or remove previously 
identified candidates. The latter makes the process more 
cfl&cicnt by excluding candidates from consideration, and 
thereby avoid the processing that would otherwise be 

50 required to evaluate the excluded candidates. 

In view of the many possible implementations of the 
invention, it should be recognized that the implementation 
described above is only an example of the invention and 
should not be taken as a limitation on the scope of the 
invention. Rather, the scope of the invention is defined by 
the following claims. We therefore claim as our invention all 
that comes within the scope and spirit of these claims. 

We claim: 

1. A method for programmatic design of a VLIW proces- 
sor comprising: 

reading a specification for at least one candidate VLIW 
processor, where the specification describes a specific 
instance of a parameterized VLIW processor design; 
65 obtaining internal resource iisage statistics for the candi- 
date VLIW processor, where the internal resource 
usage statistics indicate how operations or hardware 
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components of the candidate VLIW processor are used 
during execution of an application program on the 
candidate processor, wherein the internal resource 
usage statistics include data indicating costs of indi- 
vidual structural components; and 
using the internal resource usage statistics to provide one 
or more new candidate VLIW processor specifications 
or to exclude one or more previously identified candi- 
dates. 

2. The method of claim 1 including: 

repeating the method of claim 1 for two or more candidate 
processors to find an optimized candidate processor 
that satisfies a pre-determined design constraint; 

wherein the two or more candidate processors are selected 
programmatically from a parameterized space of pro- 
cessor designs. 

3. The method of claim 2 wherein the design constraint is 
VLIW processor cost measured in area occupied by hard- 
ware components in the description. 

4. The method of claim 2 wherein the design constraint is 
VLIW processor performance measured in execution time of 
an application program or set of application programs to be 
executed on the VLIW processor. 

5. The method of claim 1 wherein the specification is a 
structural description of the VLIW processor, including 
parameters describing instances of structural hardware com- 
ponents in the VLIW processor, and providing the new 
specification includes modifying, removing or adding a 
structural hardware component based on the internal 
resource usage statistics. 

6. The method of claim 1 wherein the specification is an 
abstract non-structural specification of the candiate VLIW 
processor, including parameters specifying processor opera- 
tions and instruction level parallelism constraints among the 
specified processor operations; and 

providing the new specification includes modifying, 
re moving or adding processor operations or ipstructiop 
level parallelism constraints based on the internal 
resource usage statistics. 

7. The method of claim 6 including: 
programmatically generating a processor datapath of the 

candidate processor from the specification, the datapath 
including register file ports and a correspondence 
between the ports and processor operations using the 
ports; 

obtaining operation issue statistics indicating how the 
application program uses the processor operations; 

using the operation issue statistics and correspondence 
between the ports and processor operatioas, determin- 
ing utilization of the ports by the application program; 
and 

based on the utilization, modifying, removing or adding 
processor operations or instruction level parallelism 
constraints based on the internal resource usage statis- 
tics. 

8. The method of claim 6 including: 
programmatically generating a processor hardware 

description from the specification, including macrocell 
instances of hardware components and processor 
operations used by the macrocell instances; 
obtaining operation issue statistics indicating how the 
application program uses the processor operations; 
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using the operation issue statistics, determining utilization 
of the macrocell instances by the application program; 
and 

^ based on the utilization, modifying, removing or adding 
processor operations or instruction level parallelism 
constraints based on the internal resource usage statis- 
tics. 

9. The method of claim 6 including: 

10 programmatically generating a processor instruction for- 
mat from the specification, including instruction tem- 
plates representing VLIW instructions having slots for 
two or more concurrently issued operations and instruc- 
tion fields for the operations; 
obtaining operation issue statistics indicating how the 

application program uses the instruction fields; 
using the operation issue statistics, determining utilization 
of the instruction fields by the application program; and 
based on the utifization, modifying, removing or adding 
processor operations or instruction level parallefism 
constraints based on the internal resource usage statis- 
tics. 

10. The method of claim 1 wherein the internal resource 
usage statistics include data indicating frequency of usage of 
structural hardware components in the candidate VLIW 
processor; and 

providing the new specification includes deleting 
instances of rarely used components or adding 
instances of highly used components. 

11. The method of claim 1 wherein the internal resource 
usage statistics include data indicating frequency that two or 
more operations are used concurrently; and 

providing the new specification includes adding instruc- 
tion level parallelism constraints to the specification to 
prohibit selected operations from being issued concur- 
rently in the new candidate processor or adding instruc- 
tion level parallelism constraints to require that the new 
candidate processor be able to execute selected opera- 
tions concurrently. 

12. The method of claim 1 wherein the internal resource 
usage statistics include data indicating usage of registers in 
the candidate processor; and 

providing the new specification includes using the internal 
usage statistics to select a number of registers in the 
new candidate processor. 

13. The method of claim 1 including: 
providing the new specification includes using the internal 

resource usage statistics to identify a structural com- 
ponent having a higher cost and lower utilization than 
other components, and modifying or deleting the struc- 
tural component having higher cost and lower utihza- 
tion. 

14. The method of claim 1 wherein the specification 
comprises a non-structural processor parameterization 
including specified processor operations and instruction 
level parallelism constraints among the specified operations, 
and 

programmatically generating a description of the new 
candidate VLIW processor, including: 
65 programmatically generating a datapath of the new 
candidate processor, including a hardware descrip- 
tion of functional units for executing the specified 
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c/perations according to the instruction level paral- 
lelism constraints, register files and an interconnect 
coupling data ports of the functional units and reg- 
ister files. 

15. The method of claim 14 including: 5 
programmalically generating an instruction format speci- 
fication including instruction templates, instruction 
fields for each template, and bit positions and encod- 
ings for each instruction field. 
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16. The method of daim 14 including: 

programmatically generating a description of control 
logic for issuing control signals to control ports of the 
functional units and register files in the datapath. 

17. A computer readable medium having software for 
performing the method of claim 1. 



10/06/2003, EAST Version: 1.04.0000 



